100% found this document useful (1 vote)
757 views367 pages

Intelligent Data Analysis

Intelligent data analysis: developing new methodologies through pattern discovery and recovery. Hsiao-fan wang, national tsing hua university, ROC hershey, new york. Product or company names used in this set are for identifcation purposes only.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
757 views367 pages

Intelligent Data Analysis

Intelligent data analysis: developing new methodologies through pattern discovery and recovery. Hsiao-fan wang, national tsing hua university, ROC hershey, new york. Product or company names used in this set are for identifcation purposes only.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Intelligent Data Analysis:

Developing New Methodologies


Through Pattern Discovery and
Recovery
Hsiao-Fan Wang
National Tsing Hua University, Taiwan, ROC
Hershey New York
I NFORMATI ON SCI ENCE REFERENCE
Director of Editorial Content: Kristin Klinger
Managing Development Editor: Kristin M. Roth
Senior Managing Editor: J ennifer Neidig
Managing Editor: J amie Snavely
Assistant Managing Editor: Carole Coulson
Copy Editor: J ennifer Young
Typesetter: Chris Hrobak
Cover Design: Lisa Tosheff
Printed at: Yurchak Printing Inc.
Published in the United States of America by
Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: [email protected]
Web site: http://www.igi-global.com
and in the United Kingdomby
Information Science Reference (an imprint of IGI Global)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 0609
Web site: http://www.eurospanbookstore.com
Copyright 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any formor by
any means, electronic or mechanical, including photocopying, without written permission fromthe publisher.
Product or company names used in this set are for identifcation purposes only. Inclusion of the names of the products or companies does
not indicate a claimof ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Intelligent data analysis : developing new methodologies through pattern discovery and recovery / Hsiao-Fan Wang, Editor.
p. cm.
Summary: This book tackles those data sets and covers a variety of issues in relation to intelligent data analysis so that patterns from
frequent or rare events in spatial or temporal spaces can be revealed. This book brings together current research, results, problems, and
applications fromboth theoretical and practical approaches--Provided by publisher.
ISBN-13: 978-1-59904-982-3 (hardcover)
ISBN-13: 978-1-59904-983-0 (e-book)
1. Pattern perception--Data processing. 2. Artifcial intelligence. I. Wang, Hsiao-Fan.
Q327.I545 2008
003.52--dc22
2007046947
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available fromthe British Library.
All work contributed to this book set is original material. The views expressed in this book are those of the authors, but not necessarily of
the publisher.
If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating
the library's complimentary electronic access to this publication.
Editorial Advisory Board
Krassimir Atanassov
Bulgarian Academy of Sciences, Bulgaria
Ivan Bruha
McMaster University, Canada
L.S. Chen
Chaoyang University of Technology Taiwan,
ROC
C.H.Chu
National Tsing Hua University, Taiwan, ROC
Alistair Forbes
Mathematics and Scientifc Computing Group,
National Physical Laboratory, UK
T.K. Ho
Bell Laboratories, Lucent Technologies, USA
C.L.Hou
National Tsing Hua University, Taiwan, ROC
C.J . Huang,
National Tsing Hua University, Taiwan, ROC
Falk Huettmann
University of Alaska-Fairbanks, USA
Wei-Kuo Hung
Senior Officer, Chung-Hua Telcome, Taiwan,
ROC
T. Kamakura
Chuo University, Japan
Arun Kulkarni
The University of Texas, USA
Chin-Yi Kuo
National Tsing Hua University, Taiwan, ROC
E. Stanley Lee
Kansas State University, USA
Yi-Chun Liao
Hsuan Chuang University, Taiwan, ROC
Valery A. Loiko
National Academy of Sciences of Belarus, Belarus
Ping Luo
Computer Science Institute of Chinese, Academy of
Science, China
Motoya Machida
Tennessee Technological University, USA
Valeri Maltsev
Institute of Chemical Kinetics and Combustion,
Russia
Trevor Martin
University of Bristol, UK
Don McNickle
University of Canterbury, NZ
Hung T. Nguyen
New Mexico State Univ.USA
Eric Pardede
Latrobe University, Australia
Robert, M. Patton
Oak Ridge National Laboratory, UK
Dilip Kumar Pratihar
Indian Institute of Technology, India
Witold Pedrycz,
University of Alberta, Canada
Heather Ruskin,
Dublin City University, Ireland
Tony Smith
University of Waikato, Hamilton, New Zealand
Eulalia Szmidt
Systems Research Institute, Polish Academy of
Sciences, Poland
M.L. Wang
Ming-Hsin University of Science & Technology,
Taiwan, ROC
Sawomir T. Wierzcho
Institute of Komputer Scence, Polish Academy of
Sciences, Poland
W. C. Yeh
National Tsing Hua University, Taiwan, ROC
Mohammed J . Zaki
Rensselaer Polytechnic Institute
Dr. Francisco Torrens Zaragoza
Institut Universitari de Ciencia Molecular Universitat
de Valencia Edifci d-Instituts, Spain
Foreword .............................................................................................................................................xv
Preface ................................................................................................................................................xvi
Acknowledgment ................................................................................................................................ xx
Section I
Introduction
Chapter I
Automatic Intelligent Data Analysis ...................................................................................................... 1
Martin Spott, Intelligent Systems Research Centre, UK
Detlef Nauck, Intelligent Systems Research Centre, UK
Chapter II
Random Fuzzy Sets .............................................................................................................................. 18
Hung T. Nguyen, New Mexico State University, USA
Vladik Kreinovich, University of Texas at El Paso, USA
Gang Xiang, University of Texas at El Paso, USA
Chapter III
Pattern Discovery in Gene Expression Data ........................................................................................ 45
Grinne Kerr, Dublin City University, Ireland
Heather Ruskin, Dublin City University, Ireland
Martin Crane, Dublin City University, Ireland
Chapter IV
Using Blackbox Algorithms Such as TreeNET and Random Forests for Data-Mining and for
Finding Meaningful Patterns, Relationships, and Outliers in Complex Ecological Data: An
Overview, an Example Using Golden Eagle Satellite Data and an Outlook for a Promising Future. ..... 65
Erica Craig, Western Ecological Studies, USA
Falk Huettmann, University of Alaska-Fairbanks, USA
Table of Contents
Chapter V
A New Approach to Classifcation of Imbalanced Classes
via Atanassovs Intuitionistic Fuzzy Sets ............................................................................................. 85
Eulalia Szmidt, Polish Academy of Sciences, Poland
Marta Kukier, Polish Academy of Sciences, Poland
Section II
Pattern Discovery from Huge Data Set: Methodologies
Chapter VI
Fuzzy Neural Networks for Knowledge Discovery ........................................................................... 103
Arun Kulkarni, The University of Texas at Tyler, USA
Sara McCaslin, The University of Texas at Tyler, USA
Chapter VII
Genetic Learning: Initialization and Representation Issues ............................................................... 120
Ivan Bruha, McMaster University, Canada
Chapter VIII
Evolutionary Computing .................................................................................................................... 131
Robert M. Patton, Oak Ridge National Laboratory, USA
Xiaohui Cui, Oak Ridge National Laboratory, USA
Yu Jiao, Oak Ridge National Laboratory, USA
Thomas E. Potok, Oak Ridge National Laboratory, USA
Chapter IX
Particle Identifcation Using Light Scattering: A Global Optimization Problem ............................... 143
M.C. Bartholomew-Biggs, University of Hertfordshire, Hatfeld, UK
Z. Ulanowski, University of Hertfordshire, Hatfeld, UK
S. Zakovic, Imperial College, UK
Chapter X
Exact Markov Chain Monte Carlo Algorithms and their Applications in
Probabilistic Data Analysis and Inference ......................................................................................... 161
Dominic Savio Lee, University of Canterbury, New Zealand
Section III
Pattern Discovery from Huge Data Set: Applications
Chapter XI
Design of Knowledge Bases for Forward and Reverse Mappings of TIG Welding Process ............. 185
J. P. Ganjigatti, Siddaganga Institute of Technology, Tumkur, India
Dilip Kumar Pratihar, Indian Institute of Technology, Kharagpur, India
Chapter XII
A Fuzzy Decision Tree Analysis of Traffc Fatalities in the U.S. ....................................................... 201
Malcolm J. Beynon, Cardiff University, UK
Chapter XIII
New Churn Prediction Strategies in Telecom Industry ...................................................................... 218
Dymitr Ruta, BT Group, Research and Venturing, UK
Christoph Adl, BT Group, Research and Venturing, UK
Detlef Nauck, BT Group, Research and Venturing, UK
Chapter XIV
Intelligent Classifcation and Ranking Analyses Using CaRBS: Bank Rating Application ............... 236
Malcolm J. Beynon, Cardiff University, UK
Chapter XV
Analysis of Individual Risk Attitude for Risk Management Based on Cumulative
Prospect Theory .................................................................................................................................. 254
Fei-Chen Hsu, National Tsing Hua University, Taiwan, ROC
Hsiao-Fan Wang, National Tsing Hua University, Taiwan, ROC
Section IV
Pattern Recovery from Small Data Set: Methodologies and Applications
Chapter XVI
Neural Networks and Bootstrap Methods for Regression Models with Dependent Errors ............... 272
Francesco Giordano, University of Salerno, Italy
Michele La Rocca, University of Salerno, Italy
Cira Perna, University of Salerno, Italy
Chapter XVII
Financial Crisis Modeling and Prediction with a Hilbert-EMD-Based SVM Approach ................... 286
Lean Yu, Chinese Academy of Sciences, China & City University of Hong Kong, Hong Kong
Shouyang Wang, Chinese Academy of Sciences, China
Kin Keung Lai, City University of Hong Kong, Hong Kong
Chapter XVIII
Virtual Sampling with Data Construction Method ............................................................................ 300
Chun-Jung Huang, National Tsing Hua University, Taiwan, ROC
Hsiao-Fan Wang, National Tsing Hua University, Taiwan, ROC
Compilation of References ..............................................................................................................309
About the Contributors ...................................................................................................................335
Index ................................................................................................................................................343
Foreword .............................................................................................................................................xv
Preface ................................................................................................................................................xvi
Acknowledgment ................................................................................................................................ xx
Section I
Introduction
Chapter I
Automatic Intelligent Data Analysis ...................................................................................................... 1
Martin Spott, Intelligent Systems Research Centre, UK
Detlef Nauck, Intelligent Systems Research Centre, UK
Chapter I. provides a software platform for automatic data analysis that uses a fuzzy knowledge base
for automatically selecting and executing data analysis methods. The authors show that a system based
on a fuzzy pattern base that stores heuristic expert knowledge from data analysis can successfully lead
to automatic intelligent data analysis. Therefore, the system is able to support business users in running
data analysis projects more effciently
Chapter II
Random Fuzzy Sets ............................................................................................................................... 18
Hung T. Nguyen, New Mexico State University, USA
Vladik Kreinovich, University of Texas at El Paso, USA
Gang Xiang, University of Texas at El Paso, USA
Chapter II provides a rigorous theory of random fuzzy sets in its most general form. Imprecise data
which are both random and fuzzy are focused. Critical issues in relation to such kind of data with hybrid
natures are discussed and a framework based on Probability Theory is proposed for analysis.
Detailed Table of Contents
Chapter III
Pattern Discovery in Gene Expression Data ........................................................................................ 45
Grinne Kerr, Dublin City University, Ireland
Heather Ruskin, Dublin City University, Ireland
Martin Crane, Dublin City University, Ireland
Chapter III highlights meaningful pattern discovery techniques for gene expression data. The properties
of gene expression data themselves are examined and the possible patterns are suggested. The classes
of clustering techniques in the context of their application to gene expression data are investigated and
a comparative analysis of standard and non-standard methods is given with the suggestion of areas for
possible future development.
Chapter IV
Using Blackbox Algorithms Such as TreeNET and Random Forests for Data-Mining and for
Finding Meaningful Patterns, Relationships, and Outliers in Complex Ecological Data: An
Overview, an Example Using Golden Eagle Satellite Data and an Outlook for a Promising Future. ..... 65
Erica Craig, Western Ecological Studies, USA
Falk Huettmann, University of Alaska-Fairbanks, USA
Chapter IV describes the use of fast, data-mining algorithms such as TreeNet and Random Forests (Sal-
ford Systems Ltd) to identify ecologically meaningful patterns and relationships in subsets of data that
carry various degrees of outliers and uncertainty. An example of using satellite data from a wintering
Golden Eagle shows that the proposed approach has provided a promising tool for wildlife ecology and
conservation management.
Chapter V
A New Approach to Classifcation of Imbalanced Classes
via Atanassovs Intuitionistic Fuzzy Sets ............................................................................................. 85
Eulalia Szmidt, Polish Academy of Sciences, Poland
Marta Kukier, Polish Academy of Sciences, Poland
Chapter V applies Atanassovs theory of intuitionistic fuzzy sets to analyze imbalanced and overlapping
classes by defning both the membership and non-membership degrees for each member. Since imbal-
anced and overlapping classes are a real challenge for the standard classifers. The method proposed in
this chapter is crucial not only in theory but also on many different types of real tasks.
Section II
Pattern Discovery from Huge Data Set: Methodologies
Chapter VI
Fuzzy Neural Networks for Knowledge Discovery ........................................................................... 103
Arun Kulkarni, The University of Texas at Tyler, USA
Sara McCaslin, The University of Texas at Tyler, USA
Chapter VI introduces fuzzy neural network models as means for knowledge discovery from databases.
It not only describes architectures and learning algorithms for fuzzy neural networks, but also proposes
an algorithm for extracting and optimizing classifcation rules from a trained fuzzy neural network. An
example of multispectral satellite images is given and it shows that the presented models and the meth-
odology for generating classifcation rules from data samples provide a valuable tool for knowledge
discovery.
Chapter VII
Genetic Learning: Initialization and Representation Issues ............................................................... 120
Ivan Bruha, McMaster University, Canada
Chapter VII discusses the paradigm of genetic algorithms and their incorporation into machine learning.
Special attention is given to 3 issues: (a) The ways of initialization of a population for a genetic algo-
rithm, (b) representation of chromosomes in genetic algorithms, and (c) discretization and fuzzifcation
of numerical attributes for genetic algorithms. Furthermore, this chapter surveys new trends of dealing
with the variable-length chromosomes and other issues related to the genetic learners.
Chapter VIII
Evolutionary Computing .................................................................................................................... 131
Robert M. Patton, Oak Ridge National Laboratory, USA
Xiaohui Cui, Oak Ridge National Laboratory, USA
Yu Jiao, Oak Ridge National Laboratory, USA
Thomas E. Potok, Oak Ridge National Laboratory, USA
Chapter VIII introduces the evolutionary computing as a whole and discusses specifcally in details
on two sub-areas of nature-inspired computing in Evolutionary Computing, namely, Evolutionary Al-
gorithms and Swarm Intelligence. The theoretical background of these sub-areas are illustrated with
demonstration of some real-world applications. The chapter also points out future trends and directions
in these areas.
Chapter IX
Particle Identifcation Using Light Scattering: A Global Optimization Problem ............................... 143
M.C. Bartholomew-Biggs, University of Hertfordshire, Hatfeld, UK
Z. Ulanowski, University of Hertfordshire, Hatfeld, UK
S. Zakovic, Imperial College, UK
Chapter IX proposes two composite approaches which combine conventional data ftting with peak-
matching to cope with noise data in solving an inverse light scattering problem for single, spherical,
homogeneous particles using least squares global optimization and show that they lead to a more robust
identifcation procedure.
Chapter X
Exact Markov Chain Monte Carlo Algorithms and their Applications in
Probabilistic Data Analysis and Inference ......................................................................................... 161
Dominic Savio Lee, University of Canterbury, New Zealand
Chapter X introduces an approach called Markov chain Monte Carlo for the exact simulation of sample
values from complex distributions. The proposed algorithm facilitates the implementation of a Markov
chain that has a given distribution as its stationary distribution. The applications of these algorithms in
probabilistic data analysis and inference are given.
Section III
Pattern Discovery from Huge Data Set: Applications
Chapter XI
Design of Knowledge Bases for Forward and Reverse Mappings of TIG Welding Process ............. 185
J. P. Ganjigatti, Siddaganga Institute of Technology, Tumkur, India
Dilip Kumar Pratihar, Indian Institute of Technology, Kharagpur, India
Chapter XI provides suitable knowledge bases (KBs) for carrying out forward and reverse mappings of
Tungsten Inert Gas (TIG) welding process. Both the forward as well as reverse mappings are required
for an effective on-line control of a process. Although conventional statistical regression analysis is able
to carry out the forward mapping effciently, it may not be always able to solve the problem of reverse
mapping. Fuzzy logic (FL)-based approaches are adopted to conduct the forward and reverse mappings
of the TIG welding process and they have shown to solve the above problem effciently.
Chapter XII
A Fuzzy Decision Tree Analysis of Traffc Fatalities in the U.S. ....................................................... 201
Malcolm J. Beynon, Cardiff University, UK
Chapter XII concerns a problem of road travel in the US, namely the discernment of the levels of traffc
fatalities across the individual states. Based on the cognitive uncertainties evident in the imprecision
inherent with the data values, a fuzzy approach to decision tree is adopted for inference. The results
show that the inference from the tree structure takes advantage of the ability of humans to distinguish
between patterns and observable characteristics.
Chapter XIII
New Churn Prediction Strategies in Telecom Industry ...................................................................... 218
Dymitr Ruta, BT Group, Research and Venturing, UK
Christoph Adl, BT Group, Research and Venturing, UK
Detlef Nauck, BT Group, Research and Venturing, UK
Chapter XIII provides a method to resolve the major problem of time discontinuity resulting from the
transactional character of events in telecom market. By gradually enriching the data information con-
tent from the prior lifetime expectancy through standard static events data up to decay-weighted data
sequences, the proposed sequential processing of appropriately preprocessed data streams is shown to
be able to have better performance of customer churn prediction.
Chapter XIV
Intelligent Classifcation and Ranking Analyses Using CaRBS: Bank Rating Application ............... 236
Malcolm J. Beynon, Cardiff University, UK
Chapter XIV applies Dempster-Shafer Theory to object classifcation and ranking. Based on this theory,
a method called CaRBS is proposed and an application to cope with uncertain reasoning on Moodys
Bank Financial Strength Rating (BFSR) process is demonstrated. The value of this chapter is placed on
the measures of ignorance such that during a series of classifcation and ranking analyses, decision on
adopting or abandoning the existing evidence can be determined.
Chapter XV
Analysis of Individual Risk Attitude for Risk Management Based on Cumulative
Prospect Theory .................................................................................................................................. 254
Fei-Chen Hsu, National Tsing Hua University, Taiwan, ROC
Hsiao-Fan Wang, National Tsing Hua University, Taiwan, ROC
Chapter XV illustrates how to describe the individuals preference structure and utilize its properties to
defne an individuals risk level for the confronted risk. Then, a response evaluation model was proposed
to develop the appropriate response strategy. These two stages of risk analysis and a risk response con-
tribute to a complete individual risk management process (IRM). A case of A-C court was demonstrated
and the results showed that the proposed method is able to provide more useful and pertinent information
than the traditional method of decision tree which is based on the expected monetary value (EMV).
Section IV
Pattern Recovery from Small Data Set: Methodologies and Applications
Chapter XVI
Neural Networks and Bootstrap Methods for Regression Models with Dependent Errors ............... 272
Francesco Giordano, University of Salerno, Italy
Michele La Rocca, University of Salerno, Italy
Cira Perna, University of Salerno, Italy
Chapter XVI introduces the use of the bootstrap in a nonlinear, nonparametric regression framework
with dependent errors. The AR-Sieve bootstrap and the Moving Block bootstrap which are used to gen-
erate bootstrap replicates with a proper dependence structure are used to avoid the inconsistent choice
inherent in conventional Bootstrap method. In the framework of neural network models which are often
used as an accurate nonparametric estimation and prediction tool, both procedures have shown to have
satisfactory results.
Chapter XVII
Financial Crisis Modeling and Prediction with a Hilbert-EMD-Based SVM Approach ................... 286
Lean Yu, Chinese Academy of Sciences, China & City University of Hong Kong, Hong Kong
Shouyang Wang, Chinese Academy of Sciences, China
Kin Keung Lai, City University of Hong Kong, Hong Kong
Chapter XVII proposes a methodology based on Hilbert-EMD-based support vector machine (SVM) to
predict fnancial crisis events for early-warning purpose. A typical fnancial indicator currency exchange
rate refecting economic fuctuation conditions is frst chosen. Then the Hilbert-EMD algorithm is applied
to the economic indicator series. This chapter also applies the proposed method to two real-world cases
of South Korea and Thailand who suffered from the 1997-1998 disastrous fnancial crisis experience.
The results show that the proposed Hilbert-EMD-based SVM methodology is capable of predicting the
fnancial crisis events effectively.
Chapter XVIII
Virtual Sampling with Data Construction Method ............................................................................ 300
Chun-Jung Huang, National Tsing Hua University, Taiwan, ROC
Hsiao-Fan Wang, National Tsing Hua University, Taiwan, ROC
Chapter XVIII proposed an alternative approach named Data Construction Analysis (DCA) to over-
come the problem derived from insuffcient data, in particular, the defects existent in one commonly
used approach called Intervalized Kernel method of Density Estimation (IKDE). Comparative studies
have shown that the proposed DCA is not only resolve the insuffcient data in general; but also improve
the prediction accuracy in both degrees and stability of IKDE. From the content described above, it
can be noted that this book will be useful for both researchers and practitioners who are interested in
receiving comprehensive views and insights from the variety of issues covered in this book in relation
to pattern discovery and recovery. In particular, those who have been working on data analysis will
have an overall picture of the existing and potential developments on the issues related to intelligent
pattern recognition
Compilation of References ..............................................................................................................309
About the Contributors ...................................................................................................................335
Index ................................................................................................................................................343
xv
Foreword
Professor Hsiao-Fan Wang has prepared a technically excellent and timely book on a new, important,
and rapidly developing area of research and applications: Intelligent Data Analysis: Developing New
Methodologies through Pattern Discovery and Recovery.
This book includes two categories of non-classical data analysis of Data Mining and Data Construction.
It is perhaps the frst book that provides comprehensive methodologies and real case applications
towards huge data warehouses and small data nests. In particular, the book takes the viewpoint of pattern
recognition and demonstrates that the tools developed can be applied to studies in learning and decision
making of general human activities as a whole.
Professor Hsiao-Fan Wang should be congratulated for an outstanding job on this edited version of the
book. Her vision on placing the theme on this rapidly developed yet very basic agenda of data analysis
is valuable and signifcant. Professor Wangs balanced approach between theories and applications is a
refection of her own extensive research experience. Her way to tackle small data samples demonstrates
very well her deep understanding of the needs and the trends of development in the real engineering
world. While data mining has become a matured research area, developing data construction techniques
for such rare events will undoubtedly draw more and more research attention in the future.
As we are entering an era in which machine intelligence and artifcial intelligence will play signifcant
roles in the design of intelligent and complex systems, the problem of recognizing patterns from either
huge or small data set will continue to challenge the researchers and practitioners. And I am sure this
book will play an important role in motivating a leading further development and applications in the
area of intelligent data analysis.

Chung Laung Liu
Honorary Chair Professor
National Tsing Hua University
C.L. Liu received his advanced degrees from the Massachusetts Institute of Technology. He taught at MIT, the
University of Illinois at Urbana Champaign, and the National Tsing Hua University. From 1996 to 1998, he served
as associate provost at UIUC. From 1988 to 2002, he was president of National Tsing Hua University (NTHU).
Dr. Liu is a member of Academia Sinica, and also, fellow of IEEE and ACM. After his term as president of NTHU,
Dr. Liu continues his teaching and research activities. He also serves as consultant to high tech companies, works
for a charitable foundation in Hong Kong, and, in the last two years, hosts a weekly radio show on technology
and humanities in the radio station IC975 in Hsinchu, Taiwan, ROC.
xvi
Preface
Intelligent Data Analysis: Developing New Methodologies Through Pattern Discovery and Recovery
provides learning tools of fnding data patterns based on artifcial intelligence. Pattern Recognition has a
long history of applications to data analysis in business, military, and social economic activities. While the
aim of pattern recognition is to discover the pattern of a data set, the size of the data set is closely related
to the methodology one adopts for analysis. The classic approach is using certain statistical techniques
to deal with data sets of more than 30 samples and by dimension reduction to reveal the pattern. With
the rapid increase of Internet development and usage, the amount of data has been enormous. The term
Data Warehouse has been used to describe such quantities of data and the corresponding methodolo-
gies for analysis are under the title of data mining.
In contrast to the huge amount of data sets, there is another type of data set which is small (less than
30), but still is signifcant in terms of socioeconomic cost. Consider severe earthquakes, random terrorist
attacks, and nuclear plant explosions; the occurrences of such events are relatively few that the conven-
tional statistic assumptions cannot be verifed and thus the methods fail to apply. The ability to predict
such kinds of events remains a challenge for the researchers. This leads to the necessity of recovering
a pattern by constructing data.
Apart from these two extreme cases related to the amount of data which affect the method of analysis
to be adopted, the types of the data are another major factor needed to be considered. Since in reality, the
collected data are never complete, a certain degree of uncertainty is always embedded. The classical ap-
proach to coping with uncertainty is based on Probability Theory in random nature. Along with different
methodologies and observations being investigated, data types other than randomness are studied and
explored. Among these, fuzzy data, grey data, and coarse data with their hybrid forms are studied most
extensively. The results pave a way to fnd data patterns from binary groupings to degree of belongings
in more accurate and precise manner.
For all of these data types in quantity and quality, apart from Probability Inference being adopted for
analysis, a group of heuristic approaches, namely soft computing (or computational intelligence), has
been developed and employed for different areas of applications. Fuzzy logic, evolutionary computing,
neural net analysis, and so on have shown their capability in coping with such kinds of data sets. It is
an art and science for intelligent data analysis
Since pattern recognition has been a learning process ever since living beings began, classical ap-
proaches to classifying data into binary groups have been enormous in the literature. Due to the increasing
impact of extreme events on the socio-economic costs, 38 authors from 10 different countries contributed
their fndings to 18 chapters in this book, each addresses different issues of intelligent pattern discovery
and recovery from both theoretical and practical viewpoints. The readers will beneft from the integra-
tion of these two extreme cases in a comparative manner.
The book is categorized into four sections. After an introduction of the up-to-date development and
research on methodologies and data properties in Section I, issues and resolution of pattern discovery
xvii
from huge data set are discussed and applied respectively in Sections II and III. Finally, in Section IV,
methodology developments and the possible applications of pattern recovery from small data sets are
presented. It can be noted from the unbalanced numbers of chapters related to huge data sets and small
data sets, methods related to pattern recovery from small data set require the devotion of more research-
ers. The outline of each chapter is given below.
In Section I, fve chapters are included, as outlined below:
Chapter I provides a software platform for automatic data analysis that uses a fuzzy knowledge base
for automatically selecting and executing data analysis methods. The authors show that a system based
on a fuzzy pattern base that stores heuristic expert knowledge from data analysis can successfully lead
to automatic intelligent data analysis. Therefore, the system is able to support business users in running
data analysis projects more effciently
Chapter II provides a rigorous theory of random fuzzy sets in its most general form. We focus on
imprecise data which are both random and fuzzy. Critical issues in relation to such kinds of data with
hybrid natures are discussed and a framework based on Probability Theory is proposed for analysis.
Chapter III highlights meaningful pattern discovery techniques for gene expression data. The proper-
ties of gene expression data themselves are examined and the possible patterns are suggested. The classes
of clustering techniques in the context of their application to gene expression data are investigated and
a comparative analysis of standard and non-standard methods is given with the suggestion of areas for
possible future development.
Chapter IV describes the use of fast, data-mining algorithms such as TreeNet and Random Forests
(Salford Systems Ltd) to identify ecologically meaningful patterns and relationships in subsets of data
that carry various degrees of outliers and uncertainty. An example of using satellite data from a winter-
ing golden eagle shows that the proposed approach has provided a promising tool for wildlife ecology
and conservation management.
Chapter V applies Atanassovs theory of intuitionistic fuzzy sets to analyze imbalanced and overlap-
ping classes by defning both the membership and non-membership degrees for each member. Since
imbalanced and overlapping classes are a real challenge for the standard classifers, the method proposed
in this chapter is crucial not only in theory but also on many different types of real tasks.
Section II of methodologies regarding pattern discovery from huge data set contains fve chapters;
each is introduced as below:
Chapter VI introduces fuzzy neural network models as a means for knowledge discovery from da-
tabases. It not only describes architectures and learning algorithms for fuzzy neural networks, but also
proposes an algorithm for extracting and optimizing classifcation rules from a trained fuzzy neural
network. An example of multispectral satellite images is given and it shows that the presented models
and the methodology for generating classifcation rules from data samples provide a valuable tool for
knowledge discovery.
Chapter VII discusses the paradigm of genetic algorithms and their incorporation into machine
learning. Special attention is given to three issues: (a) the ways of initialization of a population for a
genetic algorithm, (b) representation of chromosomes in genetic algorithms, and (c) discretization and
fuzzifcation of numerical attributes for genetic algorithms. Furthermore, this chapter surveys new trends
of dealing with the variable-length chromosomes and other issues related to the genetic learners.
Chapter VIII introduces the evolutionary computing as a whole and discusses specifcally in detail
two sub-areas of nature-inspired computing in Evolutionary Computing; namely, evolutionary algorithms
xviii
and swarm intelligence. The theoretical background of these sub-areas is illustrated with demonstration of
some real-world applications. The chapter also points out future trends and directions in these areas.
Chapter IX proposes two composite approaches which combine conventional data ftting with peak-
matching to cope with noise data in solving an inverse light scattering problem for single, spherical,
homogeneous particles using least squares global optimization and show that they lead to a more robust
identifcation procedure.
Chapter X introduces an approach called Markov chain Monte Carlo for the exact simulation of sample
values from complex distributions. The proposed algorithm facilitates the implementation of a Markov
chain that has a given distribution as its stationary distribution. The applications of these algorithms in
probabilistic data analysis and inference are given.
Section III of the applications of pattern discovery from huge data set contains fve cases from dif-
ferent industrial sectors of manufactory, transportation, and services:
Chapter XI provides suitable knowledge bases (KBs) for carrying out forward and reverse mappings of
the Tungsten Inert Gas (TIG) welding process. Both the forward as well as reverse mappings are required
for an effective online control of a process. Although conventional statistical regression analysis is able
to carry out the forward mapping effciently, it may not be always able to solve the problem of reverse
mapping. Fuzzy logic (FL)-based approaches are adopted to conduct the forward and reverse mappings
of the TIG welding process and they have shown to solve the above problem effciently.
Chapter XII concerns a problem of road travel in the US, namely the discernment of the levels of
traffc fatalities across the individual states. Based on the cognitive uncertainties evident in the impreci-
sion inherent with the data values, a fuzzy approach to decision tree is adopted for inference. The results
show that the inference from the tree structure takes advantage of the ability of humans to distinguish
between patterns and observable characteristics.
Chapter XIII provides a method to resolve the major problem of time discontinuity resulting from
the transactional character of events in telecom market. By gradually enriching the data information
content from the prior lifetime expectancy through standard static events data up to decay-weighted data
sequences, the proposed sequential processing of appropriately preprocessed data streams is shown to
be able to have better performance of customer churn prediction.
Chapter XIV applies Dempster-Shafer Theory to object classifcation and ranking. Based on this
theory, a method called CaRBS is proposed and an application to cope with uncertain reasoning on
Moodys Bank Financial Strength Rating (BFSR) process is demonstrated. The value of this chapter
is placed on the measures of ignorance such that during a series of classifcation and ranking analyses,
decision on adopting or abandoning the existing evidence can be determined.
Chapter XV illustrates how to describe the individuals preference structure and utilize its properties to
defne an individuals risk level for the confronted risk. Then, a response evaluation model was proposed
to develop the appropriate response strategy. These two stages of risk analysis and a risk response con-
tribute to a complete individual risk management process (IRM). A case of A-C court was demonstrated
and the results showed that the proposed method is able to provide more useful and pertinent information
than the traditional method of decision tree which is based on the expected monetary value (EMV).
Section IV contains three chapters of current methodologies developed for analyzing small sample
sets with illustrations of their applications.
Chapter XVI introduces the use of the bootstrap in a nonlinear, nonparametric regression framework
with dependent errors. The AR-Sieve bootstrap and the Moving Block bootstrap which are used to gen-
xix
erate bootstrap replicates with a proper dependence structure are used to avoid the inconsistent choice
inherent in conventional Bootstrap method. In the framework of neural network models which are often
used as an accurate nonparametric estimation and prediction tool, both procedures have shown to have
satisfactory results.
Chapter XVII proposes a methodology based on Hilbert-EMD-based support vector machine (SVM) to
predict fnancial crisis events for early-warning purpose. A typical fnancial indicator currency exchange
rate refecting economic fuctuation conditions is frst chosen. Then the Hilbert-EMD algorithm is applied
to the economic indicator series. This chapter also applies the proposed method to two real-world cases
of South Korea and Thailand, which suffered from the 1997-1998 disastrous fnancial crisis experience.
The results show that the proposed Hilbert-EMD-based SVM methodology is capable of predicting the
fnancial crisis events effectively.
Chapter XVIII proposed an alternative approach named Data Construction Method (DCM) to over-
come the problems derived from insuffcient data; in particular, the defects existent in one commonly
used approach called Intervalized Kernel method of Density Estimation (IKDE). Comparative studies
have shown that the proposed DCA is not only resolve the insuffcient data in general; but also improve
the prediction accuracy in both degrees and stability of IKDE.
From the content described above, it can be noted that this book will be useful for both researchers
and practitioners who are interested in receiving comprehensive views and insights from the variety of
issues covered in this book in relation to pattern discovery and recovery. In particular, those who have
been working on data analysis will have an overall picture of the existing and potential developments
on the issues related to intelligent pattern recognition.
Hsiao-Fan Wang
National Tsing Hua University
xx
Acknowledgment
The editor would like to acknowledge the help of all involved in the collation and review process of
the book, without whose support the project could not have been satisfactorily completed. Deep ap-
preciation and gratitude is due to two organizations of the Department of Management, University of
Canterbury, NZ and National Science Council, Taiwan, ROC for their facility and fnancial support this
year as visiting professor.
Deep gratitude frst is sent to Dr. Chung Laung Liu, the formal president of National Tsing Hua Uni-
versity, for his Foreword, full of encouragement and kind support. We also would like to thank all authors
for their excellent contributions to this volume. In particular, most of the authors of chapters included
in this book also served as referees for chapters written by other authors. Thanks go to all those who
provided comprehensive reviews and constructive comments. Special thanks also go to the publishing
team at IGI Global, in particular to Ross Miller and J essica Thompson, who continuously prodded via
e-mail to keep the project on schedule and to Mehdi Khosrow-Pour, whose enthusiasm motivated me
to initially accept his invitation for taking on this project.
Special thanks go to National Tsing Hua University and the colleagues of the Department of Indus-
trial Engineering and Engineering Management in Taiwan, without their understanding and support,
this volume wont be possible.
Finally, I wish to thank my boys, I-Fan (Daniel) and Tao-Fan (Ray), for their understanding and
immense love during this project.

Editor,
Hsiao-Fan Wang
Tsing Hua Chair Professor
Taiwan, Republic of China
October 2007
Section I
Introduction
1
Chapter I
Automatic Intelligent Data
Analysis
Martin Spott
Intelligent Systems Research Centre, UK
Detlef Nauck
Intelligent Systems Research Centre, UK
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
ABSTRACT
This chapter introduces a new way of using soft constraints for selecting data analysis methods that match
certain user requirements. It presents a software platform for automatic data analysis that uses a fuzzy
knowledge base for automatically selecting and executing data analysis methods. In order to support
business users in running data analysis projects the analytical process must be automated as much as
possible. The authors argue that previous approaches based on the formalisation of analytical processes
were less successful because selecting and running analytical methods is very much an experience-led
heuristic process. The authors show that a system based on a fuzzy knowledge base that stores heuristic
expert knowledge about data analysis can successfully lead to automatic intelligent data analysis.
INTRODUCTION
Data is one of the most valuable assets of todays
businesses and timely and accurate analysis of
available data is essential for making the right
decisions and competing in todays ever-changing
business environment. Most businesses today face
the analytics challenge and build ever-growing
data warehouses but face a lack in analytical com-
petence and resources. Data owners typically are
domain experts who understand the processes that
generated the data and what the data represents.
However, data owners typically are not at the
same time expert analysts and struggle with the
2
Automatic Intelligent Data Analysis
application of advanced analytics. Passing data
on to an analyst results in a communication chal-
lenge that requires the domain expert to explain
the data context and generate a problem statement
that the analyst can use as the basis for analysing
the data. When that has been done the analyst has
to present the results in a way that the data owner
can relate them to his context and derive valuable
information for future decision making.
What then follows is an application challenge
that requires IT support staff to turn analytical
results into software that can be integrated into
operational systems. With the introduction of
data analysis functionality in databases and a
standardised language for model descriptions like
PMML (Predictive Model Markup Language)
defned by the Data Mining Group (www.dmg.
org), the integration may become simpler in the
future. Under the assumption that the analysis
tool is able to create a PMML description for the
model in question and the database implements
the underlying analysis algorithm, the PMML
description can simply be included in a database
script (e.g., PL/SQL for Oracle databases) that will
be used to analyse data in the operational system.
However, it still will take many years before data
analysis is standard in databases and a large variety
of models can be transferred in that way.
Commercial data analysis software that is
aimed at a business context either is too simplis-
tic and the manufacturer has decided to provide
only limited functionality that non-expert users
can handle or the software is too complex and
provides advanced functionality that is aimed
directly at expert analysts. In order to overcome
both the analytics challenge and the communica-
tion challenge, tools are required that empower
domain experts and data owners to run advanced
analytics themselves with as little help from ana-
lysts as possible. One approach is to hide complex
analytics under a layer of automation that provides
an interface that allows users to work goal-oriented
instead of method-oriented.
In this chapter we describe an approach to au-
tomating data analysis based on a fuzzy matching
approach between user requirements and features
of analytical models. We frst discuss some general
issues around automating analytics and then we
present a software system that implements our
approach.
AUTOMATING DATA ANALYSIS
When we talk about data analysis in this chapter
we refer to the task of discovering a relationship
between a number of attributes and representing
this relationship in form of a model. Typically, we
are interested in determining the value of some
attributes given some other attributes (inference)
or in fnding groups of attribute-value combina-
tions (segmentation). In this context we will
not consider describing parameters of attribute
distributions or visualisation.
Models are typically used to support a deci-
sion making process by inferring or predicting
the (currently unknown) values of some output
attributes given some input attributes or by deter-
mining a group to which the currently observed
data record possibly belongs to. In this scenario
we expect a model to be as accurate as possible.
Models also can be used to explain a relationship
between attributes. In this scenario we want a
model to be interpretable.
A model is created in a (machine) learning
process, where the parameters of the models are
adapted based on set of training data. The learning
process can be controlled by a separate validation
set to prevent over-generalisation on the training
set. The model performance is fnally tested on a
different test set.
In business environments data and problem
owners are typically domain experts, but not data
analysis experts. That means they do not have
the required knowledge to decide which type of
model and learning algorithm to choose, how
to set the parameters of the learning procedure,
3
Automatic Intelligent Data Analysis
how to adequately test the learned model, and
so on. In order to support this group of users, we
have developed an approach for automating data
analysis to some extent.
Previous approaches towards automating data
analysis or knowledge discovery in databases
were based on AI techniques. Analysis methods
were broken down into formal blocks and user
requirements were also represented in a formal
language. Then a search algorithm would iden-
tify suitable blocks and arrange them in a way
to carry out an analysis process (Wirth, Shearer,
Grimmer, Reinartz, Schloesser, Breitner, Engels,
& Lindner 1997). These approaches had to face
the problem of formalising mainly heuristic meth-
ods and that it is usually not feasible to formally
compute all necessary parameters to execute an
analysis method. Other authors discussed mainly
architectural features of systems that could au-
tomate data analysis or data mining and avoided
discussing how to automatically select and ex-
ecute analysis methods (Botia, Velasco, Garijo,
& Skarmeta, 1998; Botia, Skarmeta, Velasco, &
Garijo 2000).
A more recent approach and the system closest
to our approach regarding the required capabili-
ties is the one described by Bernstein and Provost
(2001) and Bernstein, Hill, and Provost (2002).
Their system uses an ontology-based approach and
simply describes analysis methods and pre-/post-
processing methods as input/output blocks with
specifc interfaces. The system is built on top of
the data analysis package Weka (Witten & Frank,
2000). If the interfaces between two blocks match,
they can be concatenated in an analysis process.
If a user wants to analyse a data set, all possible
analysis processes are created and executed. Once
a suitable analysis process has been identifed, it
can be stored, re-used, and shared. The authors
suggest a heuristic ranking of analysis processes
in order to execute only the best processes. How-
ever, they only use speed as a ranking criterion,
which can be easily determined as a feature of
an algorithm. More useful features about the
quality of the analysis like accuracy obviously
are dependent on the analysis process as well as
the analysed data and are much more diffcult to
determine up front. Therefore, the reported results
have not been very encouraging so far.
In this chapter we describe a different approach
that we have followed. We have used a fuzzy
knowledge base that maps user requirements onto
properties of analysis methods by using a fuzzy
compatibility measure (Spott & Nauck, 2005).
Each requirement is represented by a fuzzy vari-
able that assumes weighted combinations of fuzzy
words as values (Spott, 2001, 2005). Rather than
modelling the weights as probabilities as Spott
(2001, 2005) describes, we assume a possibility
density function (Gebhard & Kruse, 1993; Spott,
2001) on the fuzzy words, which allows for alter-
native values which, as an extreme case, could
all be possible without restriction. This allows a
user to easily specify alternatives he is equally
happy to accept as the outcome of a modelling
process.
When conducting any kind of data analysis
and planning to use the results in an application
we can assume to have the following.
1. A problem defnition: What is the problem
we would like to solve? Do we want to pre-
dict or classify, fnd groups (segmentation,
clustering), rules or dependencies, and so
on?
2. Preferences regarding the solution: If
the model can be adapted to new data, if it
is easy to understand (rule-based, simple
functions), its accuracy, execution time, and
so on.
Experts in data analysis are well aware of
these analysis problems and preferences. In the
following, we assume that the data has already
been prepared for analysis, that is, it has been
compiled from different data sources, if neces-
sary, and formatted in a way that it can be used
by standard data analysis tools. This process is
4
Automatic Intelligent Data Analysis
mostly done with ETL tools (Extract, Transform,
and Load). An expert would then perform the
following steps.
1. Choose analysis algorithms which are suited
for the problem and the given preferences.
2. If required pre-process the data to make it
suitable for the selected algorithms.
3. Confgure the training parameters of the
algorithms.
4. Run the algorithms, check the performance
of the created models and compare against
the preferences again.
5. If further improvement seems necessary
and possible, go back to Step 3.
6. Implement the model in an executable
form.
7. Integrate the executable model into an op-
erational system.
In the following section we introduce the
software environment SPIDA and illustrate how
it covers Steps 1-5. Then we will describe the
theoretical concepts SPIDA uses to match user
preferences to attributes of data analysis methods.
The fnal two steps of the above list are discussed
after that.
SPIDA
We have developed a research prototype of an
automatic intelligent data analysis platform that
uses a fuzzy knowledge base to encode expert
knowledge about data analysis algorithms.
SPIDA (Soft Computing Platform for Intelligent
Data Analysis; Nauck, Spott, & Azvine, 2003)
is a Java based client/server application that is
capable of:
Automatically creating a data analysis model
based on user requirements,
Wrapping a learned model into an execut-
able analytical module for integration with
operational systems, and
Monitoring the performance of an opera-
tional system and trigger the generation of
a new analytical module in case of perfor-
mance deterioration.
Figure 1. The SPIDA Wizard allows users to specify several high-level preferences for the outcome of
the analysis process. Preferences for an explanation facility, for example, are used to select appropriate
data analysis algorithms while preferences for accuracy and simplicity can only be used after models
have been built.
5
Automatic Intelligent Data Analysis
SPIDA provides a wizard interface that guides
non-expert users through the data analysis pro-
cess. The process starts with identifying a data
source which can be loaded from an ASCII fle,
a spreadsheet, or a database connection and com-
pletes with an analytical model that can be used
to process data in the same format as the initially
selected data source.
The user begins with stating the purpose of the
analysis. The current version supports prediction
(i.e., classifcation and regression) and fnding
groups (i.e., clustering). After identifying a data
source SPIDA loads the data and automatically
identifes the type of variables present in the data.
The user can override the type selection and as-
sign any variable as the target of the subsequent
analysis. In the next few steps the user can specify
some high-level preferences for the outcome of
the analysis process (Figure 1). For example, the
user can specify if the model should be inter-
pretable, that is, it should be possible to explain
how a certain output value was computed. It is
possible to specify the simplicity of the explana-
tion and the type of explanation (rules or equa-
tions). Simplicity is measured by the number of
parameters in the model. For a rule-based model,
for example, a model with, say, 10 rules using
two to three variables each would be considered
simpler than a model with eight rules using fve
variables each.
The fact that the user prefers an interpretable
model is used in the selection process for analy-
sis methods. A neural network, for example, is
considered not to be interpretable, while a deci-
sion tree is.
Other preferences like accuracy and simplicity
of the model can only be evaluated after models
have been built and can only be used in selecting
the best model.
After the user has selected the preferences
important to him SPIDA can automatically decide
which analysis methods can be applied to the data
that has been loaded and how well each method
matches those user preferences that can be com-
pared against method features. SPIDA uses a fuzzy
knowledge base for this purpose. The knowledge
base encodes heuristic expert knowledge about
the suitability of analysis methods. For each
method certain features are defned, for example,
what kind of data it can process, if it produces an
interpretable model, if it can adapt to new data,
and so on. The knowledge base compares these
features against the specifcations of the loaded
data and the selected user preferences and assigns
a suitability score to each applicable method (Fig-
ure 2). The user can now let SPIDA automatically
create models based on all identifed methods or
intervene and execute only a subset.
All selected methods are then automatically
confgured, executed and evaluated. This is done
for all methods in parallel and possibly iteratively
if the model evaluation reveals that the results
are not yet satisfactory and do not match the user
requirements suffciently.
In the current version SPIDA supports the
following data analysis methods: neural networks
(classifcation and regression), neuro-fuzzy sys-
tems (classifcation and regression), support vector
machines (classifcation and regression), decision
trees (classifcation), linear regression, and fuzzy
Figure 2. Based on the type of data, the required
analysis type (here: classifcation) and the user
preferences, SPIDA selects suitable analysis
methods and ranks their suitability.
6
Automatic Intelligent Data Analysis
cluster analysis (fnding groups). In addition
SPIDA provides a large number of data flters for
treatment of missing values, re-coding, scaling
and normalisation, variables selection, record
selection (e.g., for splitting data into training and
test sets for cross-validation), and so on.
In order to execute a particular analysis
method, SPIDA automatically pre-processes or
re-codes the data in order to convert it to a format
the analysis method can process. For example, if
the method cannot handle missing values those
can be imputed automatically based on differ-
ent strategies. If the method can only handle
numeric data, but symbolic data is present this
can be re-encoded numerically. This is done, for
example, for neural networks. Created models are
evaluated automatically against test data that has
been withheld from the model training process.
The model performance and other features like
simplicity are then compared against the user
preferences again based on the rules encoded in
the SPIDA knowledge base. If SPIDA concludes
that the match with the user preferences is in-
suffcient it uses heuristics from the knowledge
base to change the parameters controlling the
model creation process. For example, if a neural
network is not providing good enough accuracy,
SPIDA may decide to repeat the training process
using more hidden neurons. A decision tree, for
example, could be re-build with stronger pruning
parameters if it turns out that it is too complex
for the required level of simplicity.
In Figure 3 the outcome of an analytical process
on the glass data set is shown. The glass data set is
a forensic data set of 214 records about six types of
glass described by nine attributes. The data set is
available at the UCI Machine Learning Repository
(UCI, 2007). By default SPIDA uses 50 percent
of the data for model building and 50 percent for
testing. For this example, we requested a simple
model, that is, the balance between accuracy and
simplicity was skewed towards simplicity (Figure
1). SPIDA decided to build a neural network, a
neuro-fuzzy classifer based on the NEFCLASS
algorithm (Nauck, 2003), a decision tree and a
support vector machine. The most accurate model
found is the neural network with 63.11 percent
accuracy on the test data. However, SPIDA ranks
its suitability as low, because the specifed user
requirements in this example requested a simple
interpretable model, preferably based on rules.
A neural network is not interpretable in terms of
rules. The most suitable model in this example is
the neuro-fuzzy model although its accuracy is
much lower. However, it uses only six rules with
altogether 15 terms to do the job.
SPIDA has attempted to fnd better solutions
both for the decision tree and the neuro-fuzzy
classifer by modifying learning parameters. The
detail views in Figure 3 reveal that a decision tree
with 101 terms would yield an accuracy of 50
percent but was not recommended, because a tree
with the lower accuracy of 45.03 percent required
only 28 terms. A similar result can be seen for
the neuro-fuzzy model. The highest accuracy is
achieved with a model using 28 terms. However,
Figure 3. After models have been built, SPIDA
re-evaluates their suitability based on the outcome
of the learning process. For some algorithms the
SPIDA fuzzy knowledge base contains heuristics
for re-running the learning process with different
parameters in order to fnd simpler or more ac-
curate solutions. The solution that best matches
the requirements is automatically selected.
7
Automatic Intelligent Data Analysis
because of the requested simplicity the model with
the second lowest accuracy of 54.1 percent and
15 terms is recommended as matching the user
requirements best.
Figure 4. For each created model SPIDA provides a detailed report. Here the confusion matrix over the
test data set and the fuzzy rules of a NEFCLASS model are shown.
Figure 5. For each model SPIDA automatically creates and executes a workspace containing graphs of
function blocks. Experts can set up and run workspaces manually if they wish.
Based on this information the user can re-
consider his requirements on model simplicity. By
returning to the dialog on balancing accuracy and
simplicity (Figure 1) and relaxing the simplicity
8
Automatic Intelligent Data Analysis
requirement in favour of accuracy, SPIDA re-
evaluates the model performances and changes
its recommendation accordingly (without building
the models again). SPIDA now would recom-
mend the neuro-fuzzy model with 63.11 percent
accuracy and 28 terms. The user can easily try
different preference combinations and guided by
the results report (Figure 4) decide which model
to fnally select for implementation.
Behind the scenes, SPIDA automatically cre-
ates workspaces for each selected data analysis
algorithm and executes them in parallel (Figure 5).
A workspace encapsulates a program in SPIDAs
graphical programming language that connects
function blocks into graphs representing the fow
of data. Expert users can directly confgure and
run workspaces if they wish.
Workspace 2 in Figure 5 is the representation
of the neuro-fuzzy classifer in our glass data
example. We can see that the data is frst passed
through a selection flter that selects the data col-
umns used in the process. The data is then encoded
to turn it into a format the neuro-fuzzy classifer
can use. In this example the method requires the
column with the symbolic class information to be
re-encoded by six numeric columns containing
membership degrees (there are six classes). The
data then is split into training and test sets and
fed into the neuro-fuzzy method. The neuro-fuzzy
block outputs the incoming data and adds six
prediction columns containing fuzzy membership
degrees for each pattern and class. A classifcation
flter defuzzifes the classifcation columns and
decides what the class prediction is for each pat-
tern. Two decode flters combine the classifcation
columns into the original symbolic class labels.
One flter decodes the columns containing the
information from the training and test sets and
one flter decodes the predictions. A merge flter
combines the columns with the actual class infor-
mation with the predicted class information and
the fnal classifcation visualiser block computes
the confusion matrices for training and test data
and the overall performance of the model.
The workspaces and the parameters in all
blocks are created, confgured and executed by
SPIDA fully automatically based on the informa-
tion encoded in SPIDAs fuzzy knowledge base.
It is possible to reconfgure the knowledge base
to build SPIDA versions for different application
areas. In fnancial services, for example, simplic-
ity may not be a concept of interest, but instead
return of investment or proft. In some technical
areas ROC (receiver/operator curve) or predictive
gain may be preferred over accuracy. Any type
of user preference and any features that can be
attributed to a method or a model can be encoded
in the knowledge base. As long as suitable fuzzy
rules have been added to the knowledge base
matching preferences to features SPIDA can use
them to control the IDA process.
In the following section we explain how
SPIDA matches user requirements and method
propertiesthe approach used by the fuzzy
knowledge base to decide which analysis methods
to execute.
FUZZY MATCHING OF
REQUIREMENTS AND PROPERTIES
SPIDA uses fuzzy techniques to describe user
requirements and the features and desired proper-
ties of data analysis methods and created models.
In addition, mappings between requirements and
properties are modelled as fuzzy rules. A closer
look at the requirements and properties reveals
that some of them carry symbolic (categorical)
values, whereas others are of numeric nature.
Examples for the frst category are the type of
analysis problem (classifcation, function ap-
proximation, clustering, etc.), while accuracy and
simplicity represent the second category. Obvi-
ously, fuzzy sets can be defned on numeric as
well as symbolic (categorical) domains. However,
a more homogeneous approach has been presented
by Spott (2001). The main idea is to represent in-
formation by two levels of abstraction. The lower
9
Automatic Intelligent Data Analysis
level is the level of details like the potentially
infnite number of possible values in a continuous
numeric domain of discourse. At the higher level
of abstraction we only deal with a fnite number
of symbols by abstracting from details. In other
words, the granularity of information can be either
fne or coarse (potentially, entire hierarchies of
granularity are conceivable). If we measure ac-
curacy of a prediction model we can do so at the
level of details as a value in [0,1], for example,
or at the symbolic level in terms of symbols like
high accuracy, medium accuracy, or low
accuracy. What this approach allows us to do is
to represent inherently symbolic information like
the type of analysis problem mentioned above at
the same level as symbols like high accuracy
which in fact are abstractions from a level of fner
granularity.
The advantage of this approach compared to
simply defning fuzzy sets on symbolic or numeric
domains becomes more obvious when we think
of expressing information using a combination
of symbols like I fully accept a model of high
accuracy, but to a lower degree I would accept a
model of medium accuracy, as well. This expres-
sion could be quantifed in a requirement as high
accuracy (1.0) +medium accuracy (0.6). We call
such an expression a weighted combination of
symbols or in short a combination. Spott (2005)
has shown how to process such expressions at the
symbolic level. Therefore, it does not matter, if
the terms used are inherently symbolic or if they
are actually fuzzy sets themselves. In this way, all
symbols can be treated in a coherent way.
Spott (2001, 2005) proposed a probabilistic
model for the weights in combinations. That means
in particular that the sum of the weights is 1. For
SPIDA we extended this model by allowing a sum
of weights larger than 1 (Spott & Nauck, 2005).
Obviously, such weights can not be interpreted as
probabilities; rather, they represent the existence
of alternatives.
We assume that the original user requirements
have already been mapped onto desired properties
and that information about the respective method
and model properties is available. Each property
is represented by a fuzzy variable that assumes
combinations of fuzzy words as values. The de-
sired accuracy, for example, could be medium
(1.0) +high (1.0) whereas a created analysis model
might be accurate to the degree of low (0.3) +
medium (0.7). In other words, we are looking
for a model with medium or high accuracy, and
the created models accuracy is low with degree
0.3 and medium with degree 0.7. In this example,
the weights for the desired accuracy sum up to
two, whereas the ones for the actual accuracy add
up to one. We interpret a combination of fuzzy
words with sum of weights greater than one as
alternative fuzzy words. Rather than modelling the
weights as probabilities, we assume a possibility
density function (Gebhardt, 1993; Spott, 2001)
on the fuzzy words, which allows for alternative
values which, as an extreme case, could all be
possible without restriction. This way, we defne
the degree of possibility of a fuzzy word in con-
junction with probabilities of the related method
or model property. The degree of possibility
stands for the maximum acceptable probability
of a property. Degrees of possibility can be any
real number in [0,1].
In the example above, we were entirely happy
with an analysis model of medium or high accu-
racy. We therefore assigned the possibilistic weight
1 to both of them, that is, models exhibiting the
property low (0.0) +medium (a) +high (b) with
a+b=1 are fully acceptable. In case of the require-
ment low (0.0) +medium (0.0) +high (1.0) and
the above property with a>0 the weight a exceeds
the possibility for medium and therefore at least
partially violates the requirements.
Building on these ideas we stipulate that
requirements are represented as a possibilistic
combination of fuzzy words, that is, at least one
of the fuzzy words carries the weight one, so the
sum of weights of the fuzzy words is at least one.
This is based on the assumption that at least one
of the alternative requirements is fully acceptable.
10
Automatic Intelligent Data Analysis
Properties on the other hand stand for existing
properties of a given model, so we assume that
they are represented by a combination of fuzzy
words, where the sum of weights of fuzzy words
equals one.
The following part of the section deals with
the question of how a match of requirements and
properties can be quantifed, that is, we are looking
to measure the compatibility of properties with
requirements. We will frst focus on the compat-
ibility of a single pair of requirement/property and
then consider ways of combining several degrees
of compatibility in one value.
In order to fnd an appropriate compatibil-
ity measure we stipulate a number of required
properties. For reasons of simplicity, we refer to
combinations of fuzzy words simply as fuzzy
sets (on fuzzy words) R
~
for requirements and
~P for properties, defned on a fnite universe
of discourse X. Subsethood of fuzzy sets is
defned in the traditional way (Zadeh, 1965) by
) ( ) ( :
~ ~
~ ~ x x X x B A
B A
with
A
~being
the membership function of A
~
. For a compatibility
measure C( P
~
, R
~
) we require
C1 C( P
~
, R
~
) [0,1]
C2 If the fuzzy sets are disjoint, compatibility
is 0: P
~
R
~
= C( P
~
, R
~
) =0
C3 If the requirement fuzzy set is a superset of
the property fuzzy set, compatibility is
1: R
~

P
~
C( P
~
, R
~
) =1.
C4 Monotony in both arguments:
a) '
~
P P
~
C( P
~
, R
~
) C( P
~
, R
~
),
b) '
~
R R
~
C( P
~
, '
~
R ) C( P
~
, R
~
).
C1 is simply a normalisation, whereby a match
value of 1 means full match and a value of 0 stands
for no match. If the fuzzy sets are disjoint the
property does not meet any requirement and the
match is therefore 0 (C2). In C3 we make sure that
the degree of match is 1 as long as the requirement
covers the property, that is, the property is fully
acceptable. Finally, monotony in C4 means that
the more relaxed the spectrum of requirements
or the more specifc the properties are the higher
is the degree of compatibility.
Properties C1 - C4 resemble typical properties
of measures for fuzzy set inclusion. Normally,
such measures are generalisations of the inclu-
sion of crisp sets based on a number of axioms
(Cornelis, Van der Donck, & Kerre, 2003; Sinha &
Dougherty, 1993). However, many of these axioms
are too strict for our application. Also, the inclu-
sion measure )) ( ), ( ( min )
~
,
~
( ~ ~ x x I B A Inc
B A
X x
=
with fuzzy implication operator I proposed in
the above publications suffers from the serious
drawback that it draws its value from a mini-
mum operation, that is, for quite different fuzzy
sets A
~
, for instance, Inc would return the same
degree of inclusion. In mathematical terms, Inc
is not strictly monotonous. Furthermore, it does
not take into account the relation in size of the
subset of A
~
that is included in B
~
and the one that
is not included.
We therefore propose a different measure
which meets all our requirements. As already
mentioned above, we assume ( ) 1
R x X
x



and

=
X x P
x 1 ) ( ~ .

+ - - - =

X x
R
X x
P R
x x x R P C 1 ) ( ) ( ) (
2
1
1 : )
~
,
~
( ~ ~ ~
which can be rewritten as

>

- - =
) ( ) (
~ ~
~ ~
) ( ) ( 1 )
~
,
~
(
x x
X x
R P
R P
x x R P C . (1)
Before we prove that the measure C fulfls
requirements C1 - C4 let us explain what the
measure actually does. Figure 6 shows a fuzzy
set R
~

for requirements on the left hand side (we
used a continuous domain for better illustration),
the right triangular function is the fuzzy set P
~
for properties. The right term in (1) measures the
size of the grey area, which can be interpreted
as a degree to which the properties violate the
requirements. The size of the area is bounded by
11
Automatic Intelligent Data Analysis
the area underneath the membership function of
P
~

which we stipulated to be 1. That means that
the right term measures the proportion of proper-
ties P
~

that violate the requirements. Determining
the grey area is related to determining the size
of the fuzzy difference P
~
- R
~
, but the operator
( ) ) ( ), ( 1 ~ ~ x x t
P R
- for the fuzzy difference does
not match our semantics for properties and re-
quirements.
From (1) immediately follows:
Lemma 1: For the fuzzy set '
~
R with
{ }
) ( ), ( min ) ( ~ ~
'
~ x x x
P R R
=
holds C( P
~
, '
~
R ) =
C( P
~
, R
~
).
The Lemma shows that the requirements R
~
can be stripped down to the intersection '
~
R of
requirements and properties P
~
(using min as
t-norm) without changing the compatibility of
properties with requirements, see Figure 6.
Lemma 2: The measure C defned above conforms
to properties C1 - C4.
Proof:
1. Since the right term in (1) is obviously not
negative and the following holds

>

= -
) ( ) (
~
(*)
~ ~
~ ~
1 ) ( ) ( ) (
x x
X x X x
P R P
R P
x x x
, (4)

C meets requirement C1.
2. In case of = R P
~ ~
, we have equality at
(*) in (4) and therefore C2 is met.
3. If P R
~ ~
,then the set ) ( ) ( : ~ ~ x x X x
R P
> is
empty, the right term in (1) is 0 and therefore
C( P
~
, R
~
) =1.
4. If R
~
grows or P
~
shrinks the value of the
right term in (1) cannot decrease (equiva-
lently the size of the grey area in Figure 6
cannot shrink) therefore the value of C does
not decrease.

Furthermore, it turns out that C is a measure
of satisfability as defned by Bouchon-Meunier,
Rifqi, and Bothorel (1996) which is not surprising
since their notion of satisfability is very similar
to our understanding of compatibility.
Since we deal with a set of requirements and
properties, we end up having one match value
for each requirement/property pair. Requiring a
set of properties can formally be interpreted as a
logical conjunction of the individual requirements.
Given the assumption that all requirements are
equally important, we therefore propose to use a
t-norm to aggregate the individual match values.
We decided to use multiplication, as it is a strictly
monotonous operator. Strict monotony basically
means that the overall match value decreases with
any of the individual match values. Other operators
like minimum do not have this property. In case
of minimum, the overall match obviously is the
minimum of the individual matches. That means
Figure 6. Fuzzy sets of requirements, properties P
~
and the intersection '
~
R (we used a continuous domain
for better illustration). The grey area represents incompatibility of properties with requirements.
12
Automatic Intelligent Data Analysis
that all the match values apart from the minimum
can be increased without changing the overall
value. This is not the desired behaviour in our
case since many different sets of properties would
result in the same overall match value as long as
the minimal value is the same. So the proposed
measure for a multi-criteria match is

=
>

- - =
=
m
j
x x
X x
R P
j j
m
j
j R j P
j
j j
x x
R P C R P C
1
) ( ) (
~ ~
1
~ ~
) ( ) ( 1
)
~
,
~
( )
~
,
~
(
SPIDA AS AN ANALYTICAL
SERVICE
A model that has been created by SPIDA only
really becomes useful if it can be integrated into
an operational process. For example, the customer
relationship management system of a business
may require a model to classify customers for
marketing purposes like cross-selling, a network
management system may require a model to
monitor traffc patterns and identify potential
intrusions or a fnancial services system may
require a model for predicting tomorrows stock
exchange rates to support buy/sell decisions. In
order to support these types of operational sys-
tems, a model has to be taken out of the analytical
software environment and it has to be integrated
in order to continuously score new data.
Integration can be done loosely, by running
models out of the analytical environment and ap-
plying it to database tables that are in turn used
by operational systems. In the future, the modern
database system will offer analytical services and
store models in PMML thus integrating analytical
environment and data management. However,
sometimes this form of integration is not suf-
fcient, because operational system can be based
on proprietary database systems without common
interfaces or data must be scored on the fy in real
time without frst going into a database table.
Integration is made more diffcult by the fact
that from time to time, models have to be adjusted
in order to keep up with changes in the business
or the market. Such adaptations are usually done
manually by analysts, who produce new models
based on new data. Most of the times, the un-
derlying analysis algorithm stays the same, only
parameters change. For example, the weights of
a neural network or the rules in a fuzzy system
might be adapted.
In case of data analysis functionality in the
database and a PMML model, even algorithms
can be changed without compromising the integ-
rity of operational systems. For example, it might
turn out that a decision tree performs better on a
new data set than a support vector machine that
has been used before. By simply exchanging the
PMML description of the model in the database,
the new model can be integrated seamlessly.
However, such database systems are not standard,
yet, and furthermore the decision for the new
algorithm and its integration would still require
manual interaction.
The SPIDA wizard can automate this process
by selecting, confguring, and implementing new
algorithms without user interaction. The fnal
decision, if a new algorithm will be uploaded
into the operational system probably will remain
under human supervision, but the diffcult part of
creating it can mostly be achieved by SPIDA.
A schema summarising all scenarios in com-
bination with SPIDA is shown in Figure 7. Data
is fed into the operational system from a data
warehouse. The data analysis module conducts
data analysis for the operational system. This
way it infuences the systems performance or
the performance of related business processes,
because analysis results are usually used to make
business decisions. For example, the data analysis
module might be used to predict if a customer is
going to churn (i.e., cancel his contract) and per-
formance measures might be the relative number
of churners not being fagged up by the system
and the resulting loss of revenue. In this example,
13
Automatic Intelligent Data Analysis
a drop in system performance will be the result
of an inferior analytics model. Such a drop in
performance will be detected by the monitor and
trigger the generation of a new analysis model.
The easiest way to detect dropping performance
is by comparing against predefned thresholds.
The reasons for deterioration can be manifold.
In most cases, the dynamics of the market are
responsible, that is, the market has changed in
a way that cannot be detected by the analytical
model. Other reasons include changes in related
business processes and data corruption.
Whenever model generation is triggered by the
monitor, the wizard takes the user requirements
and the latest data to build a new model. Another
reason for building a new model is a change of
user requirements. For example, due to a change
of company policy or for regulatory reasons, a
model may now be required to be comprehensible.
If, for instance, a neural network had been used to
measure the credit line of a potential borrower, we
would be forced to switch to methods like deci-
sion trees or fuzzy systems which create a rule
base that can be understood by managers and the
regulator. This may be required to demonstrate
that the model conforms to an equal opportunities
act, for example.
When it comes to integrating modules with
operational systems, software engineers can
nowadays choose from a variety of techniques
that both depend on the operational system and
the module. Where some years ago, modules were
quite often especially tailored for the operational
system, nowadays, much more modular approach-
es are used that allow for using the same modules
in different operational systems on the one hand,
and replacing modules in operational systems on
the other hand. Apart from the database approach
where data analysis can directly be conducted
by the database and analysis models are defned
using a standard like PMML, other techniques
use libraries with standard interfaces (API) and
provide a respective interface in the operational
system or use even less coupled techniques like
Web services. For SPIDA, we decided to build
executable Java libraries which can easily be
used in either way.
Since we require a variety of different data
analysis techniques to cover as many applications
as possible, the starting point for the generation
of analysis modules are data analysis platforms
like SPIDA. The actual analysis model required
by an operational system typically uses only
a tiny fraction of such a platforms functional-
ity. For reasons of saving space and potentially
licensing costs we want to restrict the module to
the functions actually needed by the underlying
data analysis model or process.
In case of SPIDA, we are looking to create
analytical modules, which are pieces of software
independent from the SPIDA platform, cor-
responding to a particular data analysis model.
Figure 7. Monitoring of system performance and requirements and automatic creation and integration
of data analysis modules according to given requirements
14
Automatic Intelligent Data Analysis
Once extracted, the independent software module
can be used as a library with an exposed API by
the operational system. Changes in the analytical
model are completely hidden from the operational
system by the module, that is, the module can
easily be replaced if required due to changes in
requirements or a drop in performance.
At the heart of the integration of an analytic
module in SPIDA is a general framework with
which an operational system can communicate
and specify its requirements and obtain an analytic
module from SPIDA as a Java library.
First, the defnition of the analysis problem
and the user requirements are specifed in XML.
The SPIDA wizard parses the XML description
and identifes the corresponding analysis com-
ponents like data access, flters, and the actual
analysis blocks. A mapping is made between the
required components and the corresponding Java
classes of the platform. SPIDAs software extractor
component then fnds all the dependent classes
for this initial set of classes. The extractor uses
application knowledge and a host of other extrac-
tion techniques to fnd all the additional classes
required to make it executable and independent
from the platform. This complete set of all com-
ponent classes is turned into a Jar library with
suitable properties by SPIDAs Jar creator. This
Jar library is passed to the operational system,
which can access the API provided in the library
to control the execution of the analysis module. It
is also possible to instruct SPIDA to deliver the
Jar library in form of a stand-alone package that
is able to directly execute the included model for
specifc data analysis tasks.
If we think of data analysis services, we
can expect many different concurrent analysis
requests. Since each analysis process will create
an executable object, the size of dedicated Jar
libraries should be kept as small as possible in
order to allow for as many users as possible. Table
1 shows the compression rates for libraries created
for different data mining algorithms compared to
the whole SPIDA package, which range between
47 percent and 17 percent.
CASE STUDIES
The SPIDA platform discussed in this chapter is
a research demonstrator, and its main purpose
was to build and test methods for automatically
generating predictive models in a data analysis
process. Although the platform is constantly used
in our team for doing day-to-day data analysis ap-
plying it to operational processes requires further
work. In this section we briefy mention three suc-
cessful internal applications that build on SPIDA
technology, that is, methods for automatically
building models from data. Instead of making
the whole platform available in an operational
environment we decided to build smaller self-
contained applications targeting three business
areas: travel management for mobile workforces,
customer satisfaction analysis, and customer life
cycle modelling.
BT has a large mobile workforce for main-
taining its network, providing customers with
products and service and repairing faults. A dy-
namic scheduler is used to plan the working day
of each engineer and assign the right job to the
right person. In order to provide good schedules it
is important to have good estimates for job dura-
tions and inter-job travel times. We have built an
intelligent travel time estimation system (ITEMS)
that estimates inter-job times based on historic
Model Relative Size
Linear Regression 33%
NEFPROX 48%
Decision Trees 17%
Support Vector Machines 17%
Neural Networks 21%
NEFCLASS 47%
Table 1. Compression rates of dedicated Java
libraries for various data analysis algorithms
compared to whole SPIDA package
15
Automatic Intelligent Data Analysis
evidence (Azvine, Ho, Kay, Nauck, & Spott,
2003). An important part of ITEMS is a graphi-
cal interface where resource managers can study
patterns that lead to engineers arriving late for
jobs. The purpose of this system is to understand
where process issues lead to delays and how the
process can be improved. SPIDA technology is
used to learn patterns from data.
In order to be useful the rules must be simple
and sparse and they must be created automatically
in real-time for a specifc set of journeys for which
the user requires explanations (Nauck, 2003). We
had two algorithms available from SPIDA that
were suitable for this purpose: decision trees and
a neuro-fuzzy classifer (NEFCLASS). These
two algorithms and the required SPIDA con-
fgurations for creating small interpretable rules
based were implemented into ITEMS. Figure 8
displays a screen shot of typical situation while
using ITEMS. A user analyzes the travel pattern
of some technician and has clicked an arrow in
the displayed map. Additional windows display
detailed information about the corresponding job.
The top right window displays two fuzzy rules
matching the travel information of the selected
job. The user can see the degree of fulflment of
a rule and decide, if a particular rule is useful for
explaining the selected travel pattern, that is, why
the technician was late, early, or on time.
Another application based on SPIDA tech-
nology has been built in the area of customer
satisfaction analysis. Most service-oriented
companies regularly survey their customers to
learn about their satisfactions with the companys
customer services. Variables in a survey typically
display high degrees of interdependences which
is important to take into account when modelling
survey results. In our application we also had to
provide abilities for conducting what-if analyses
and target setting, that is, planning how to achieve
certain level of customer satisfaction. We identi-
fed Bayesian networks as the ideal model for this
scenario and used SPIDA technology to implement
an applicationiCSatthat fully automatically
learns a Bayesian network from data (Nauck,
Ruta, Spott, & Azvine, 2006). It allows the user
to identify drivers of customer satisfaction, to
run what-if scenarios for understanding how
satisfaction levels might change when certain
drivers change, and to simulate desired levels of
customer satisfaction for identifying targets for
drivers of satisfaction.
The fnal application based on SPIDA tech-
nology is used to model customer life cycles.
We have built an intelligent customer analytics
platform that can automatically learn hidden Mar-
kov models from customer event data (Azvine,
Nauck, Ho, Broszat, & Lim, 2006). The models
can be used to predict probabilities and timings
of relevant future customer events like churn,
complaints, purchases, and so on. The predictions
are used to control pro-active customer services,
like contacting a customer in time before a service
problem escalates and could lead to complaints
or churn.
CONCLUSION
As a new direction in automating data analysis,
we introduced the concept of using soft constraints
Figure 8. The ITEMS system with automatically
derived rules for explaining travel patterns
16
Automatic Intelligent Data Analysis
for the selection of an appropriate data analysis
method. These constraints represent the users
requirements regarding the analysis problem (like
prediction, clustering, or fnding dependencies)
and preferences regarding the solution.
Requirements can potentially be defned at any
level of abstraction. Expert knowledge in terms of
a fuzzy rule base maps high-level requirements
onto required properties of data analysis methods
which will then be matched to actual properties
of analysis methods.
As a result of our work, we introduced a new
measure for the compatibility of fuzzy require-
ments with fuzzy properties that can be applied
to other problems in the area of multi-criteria
decision making. The methods presented above
have been implemented as a wizard for our data
analysis tool SPIDA, which has been successfully
used to produce solutions to a variety of problems
within BT.
SPIDA can serve as a smart data analysis
service for operational systems. Where current
systems require manual intervention by data
analysis experts if a data analysis module needs
updating, our architecture automatically detects
changes in performance or requirements, builds
a new data analysis model, and wraps the model
into an executable Java library with standard API
that can easily be accessed by operational systems.
When building data analysis models, the wizard
takes into account high-level user requirements
regarding an explanation facility, simplicity of an
explanation, adaptability, accuracy, and so on. In
this way, we can vary the data analysis solution
as much as possible without violating require-
ments. Before, adaptive solutions only retrained
models based on new data, for example, an analy-
sis method like a neural network was selected
once and only adapted to new data. Switching
automaticallywithout user interactionto
another method, for example, to a support vector
machine was not possible, let alone considering
user requirements for the automatic selection of
the most suitable analysis method.
REFERENCES
Azvine, B., Ho, C., Kay, S., Nauck, D., & Spott, M.
(2003). Estimating travel times of feld engineers.
BT Technology Journal, 21(4), 33-38.
Azvine, B., Nauck, D.D., Ho, C., Broszat, K., &
Lim, J. (2006). Intelligent process analytics for
CRM. BT Technology Journal, 24(1), 60-69.
Bouchon-Meunier, B., Rifqi, M., & Bothorel,
S. (1996). Towards general measures of com-
parison of objects. Fuzzy Sets & Systems, 84(2),
143153.
Bernstein, A., Hill, S., & Provost, F. (2002). In-
telligent assistance for the data mining process:
An ontology-based approach (CeDER Working
Paper IS-02-02). New York: Center for Digital
Economy Research, Leonard Stern School of
Business, New York University.
Bernstein, A. & Provost, F. (2001). An intelligent
assistant for the knowledge discovery process
(CeDER Working Paper IS-01-01). New York:
Center for Digital Economy Research, Leonard
Stern School of Business, New York Univer-
sity.
Botia, J.A., Velasco, J.R., Garijo, M., & Skarmeta,
A.F.G. (1998). A generic datamining system.
Basic design and implementation guidelines. In
H. Kargupta & P.K. Chan (Eds.), Workshop in
distributed datamining at KDD-98. New York:
AAAI Press.
Botia, J.A., Skarmeta, A.F., Velasco, J.R., &
Garijo, M. (2000). A proposal for Meta-Learning
through a MAS. In T. Wagner & O. Rana (Eds.),
Infrastructure for agents, multi-agent systems,
and scalable multi-agent systems (pp. 226-233).
Lecture notes in artifcial intelligence 1887. Berlin:
Springer-Verlag.
Cornelis, C., Van der Donck, C., & Kerre, E.
(2003). Sinha-Dougherty approach to the fuzzi-
fcation of set inclusion revisited. Fuzzy Sets &
Systems, 134, 283295.
17
Automatic Intelligent Data Analysis
Gebhardt, J. & Kruse R. (1993). The context
model An integrating view of vagueness and
uncertainty. International Journal of Approximate
Reasoning, 9, 283314.
Nauck, D.D. (2003a). Fuzzy data analysis with
NEFCLASS. International Journal of Approxi-
mate Reasoning, 32, 103-130.
Nauck, D.D. (2003b). Measuring interpretability
in rule-based classifcation systems. Proceedings
of the 2003 IEEE International Conference on
Fuzzy Systems (FuzzIEEE2003), 1, pp. 196-201.
Saint Louis, MO: IEEE Press.
Nauck, D.D., Ruta, D., Spott, M., & Azvine, B.
(2006) A tool for intelligent customer analytics.
Proceedings of the IEEE International Confer-
ence on Intelligent Systems, pp. 518-521. London:
IEEE Press.
Nauck, D.D., Spott, M., & Azvine, B. (2003).
Spida A novel data analysis tool. BT Technology
Journal, 21(4), 104112.
Sinha, D. & Dougherty, E. (1993). Fuzzifcation
of set inclusion: theory and applications. Fuzzy
Sets & Systems, 55, 1542.
Spott, M. (2001). Combining fuzzy words.
Proceedings of IEEE International Conference
on Fuzzy Systems 2001. Melbourne, Australia:
IEEE Press.
Spott, M. (2005). Effcient reasoning with fuzzy
words. In S.K. Halgamuge & L.P. Wang (Eds.),
Computational intelligence for modelling and
predictions, pp. 117-128. Studies in Compuational
Intelligence 2. Berlin: Springer Verlag.
Spott, M. & Nauck, D.D. (2005). On choosing an
appropriate data analysis algorithm. Proceedings
of the IEEE International Conference on Fuzzy
Systems 2005, Reno, NV: IEEE Press.
University of California at Irvine (2007). UCI
Machine Learning Repository. Accessed March,
26, 2007 at http://www.ics.uci.edu/~mlearn/ML-
Repository.html.
Wirth, R., Shearer, C., Grimmer, U., Reinartz,
J., Schloesser, T.P., Breitner, C., Engels, R., &
Lindner, G. (1997). Towards process-oriented tool
support for knowledge discovery in databases.
Principles of Data Mining and Knowledge Dis-
covery. First European Symposium, PKDD 97,
pp. 243253. Lecture Notes in Computer Science
1263, Berlin: Springer-Verlag.
Witten, I.H. & Frank, E. (2000). Data mining:
Practical machine learning tools and techniques
with JAVA implementations. San Francisco: Mor-
gan Kaufmann Publishers.
Zadeh, L. A. (1965). Fuzzy sets. Information and
Control, 8, 338353.
18
Chapter II
Random Fuzzy Sets
Hung T. Nguyen
New Mexico State University, USA
Vladik Kreinovich
University of Texas at El Paso, USA
Gang Xiang
University of Texas at El Paso, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
ABSTRACT
It is well known that in decision making under uncertainty, while we are guided by a general (and ab-
stract) theory of probability and of statistical inference, each specifc type of observed data requires its
own analysis. Thus, while textbook techniques treat precisely observed data in multivariate analysis,
there are many open research problems when data are censored (e.g., in medical or bio-statistics),
missing, or partially observed (e.g., in bioinformatics). Data can be imprecise due to various reasons,
for example, due to fuzziness of linguistic data. Imprecise observed data are usually called coarse data.
In this chapter, we consider coarse data which are both random and fuzzy. Fuzziness is a form of im-
precision often encountered in perception-based information. In order to develop statistical reference
procedures based on such data, we need to model random fuzzy data as bona fde random elements,
that is, we need to place random fuzzy data completely within the rigorous theory of probability. This
chapter presents the most general framework for random fuzzy data, namely the framework of random
fuzzy sets. We also describe several applications of this framework.
19
Random Fuzzy Sets
FROM MULTIVARIATE STATISTICAL
ANALYSIS TO RANDOM SETS
What is a random set? An intuitive mean-
ing. What is a random set? Crudely speaking,
a random number means that we have different
numbers with different probabilities; a random
vector means that we have different vectors with
different probabilities; and similarly, a random set
means that we have different sets with different
probabilities.
How can we describe this intuitive idea in
precise terms? To provide such a formalization,
let us recall how probabilities and random vectors
are usually defned.
How probabilities are usually defned. To
describe probabilities, in general, we must have a
set of possible situations , and we must
be able to describe the probability P of different
properties of such situations. In mathematical
terms, a property can be characterized by the set
of all the situations which satisfy this property.
Thus, we must assign to sets A , the prob-
ability value P (A).
According to the intuitive meaning of
probability (e.g., as frequency), if we have
two disjoint sets A and A , then we must have
) ( ) ( ) ( A P A P A A P + = . Similarly, if we have
countably many mutually disjoint sets A
i
, we
must have:
) (
1 1
i
i
i
n
i
A P A P

= =
=

.
A mapping which satisfes this property is
called -additive.
It is known that even in the simplest situations,
for example, when we randomly select a number
from the interval =[0,1], it is not possible to
have a -additive function P which would be
defned on all subsets of [0,1]. Thus, we must
restrict ourselves to a class A of subsets of .
Since subsets represent properties, a restriction
on subsets means restriction on properties. If we
allow two properties F and F', then we should also
be able to consider their logical combinations F
& F', F F , and F which in set terms cor-
respond to union, intersection, and complement.
Similarly, if we have a sequence of properties F
n
,
then we should also allow properties nF
n
and
nF
n
which correspond to countable union and
intersection. Thus, the desired family A should
be closed under (countable) union, (countable)
intersection, and complement. Such a family is
called a -algebra.
Thus, we arrive at a standard defnition of
a probability space as a triple ) , , ( P A , where
is a set, A is a -algebra of subsets of , and
[ ] 1 , 0 : A P is a -additive function.
How random objects are usually defned.
Once a probability space is fxed, a random object
from the set C is defned as mapping : V C.
For example, the set of all d-dimensional vectors
is the Euclidean space R
d
, so a random vector is
defned as a map :
d
V R .
The map V must enable us to defne probabili-
ties of different properties of objects. For that,
we need to fx a class of properties; we already
know that such a class B should be a -algebra.
For each property of objects, that is, for each set
B B , it is natural to defne the probability P(B)
as the probability that for a random situation ,
the corresponding object V( ) belongs to the set
B (i.e., satisfes the desired property). In precise
terms, this means that we defne this probability
as P(V
-1
(B), where V
-1
(B) denotes
{ : V( ) B}.
For this defnition to be applicable, we must
require that for every set B from the desired -
algebra B, the set V
-1
(B) must belong to A. Such
mappings are called A-B-measurable.
How random vectors are usually defned. In
particular, for random vectors, it is reasonable to
allow properties corresponding to all open sets B
20
Random Fuzzy Sets
R
d
(like x
1
>0) and properties corresponding
to all closed sets B (such as x
1
0). So, we must
consider the smallest -algebra which contains
all closed and open sets. Sets from this smallest
subalgebra are called Borel sets; the class of all
Borel sets over R
d
is denoted by B( R
d
). So, a
random vector is usually defned as A-B
d
measur-
able mapping V:R
d
.
Alternatively, we can consider the vectors
themselves as events, that is, consider a probability
space (R
d
, B (R
d
), P
V
). In this reformulation, as
we have mentioned, for every set B, we get
P
V
(B) =P(V
-1
(B), so we can say that P
V
is a com-
position of the two mappings: P
V
=PV
-1
. This
measure P
V
is called a probability law of the
random vector.
How random vectors are usually described.
From the purely mathematical viewpoint, this is
a perfect defnition of a random vector. However,
from the computational viewpoint, the need to
describe P
V
(B) for all Borel sets B makes this
description impractical. Good news is that due to
-additivity, we do not need to consider all possible
Borel sets B, it is suffcient to describe P
V
(B) only
for the sets of the type
] , ( ] , (
1 d
x x - -
, that
is, it is suffcient to consider only the probabilities
that d d
x x & &
1 1

.
In the one-dimensional case, the probability
F(x) that x is called a (cumulative) distribution
function. Similarly, in the general d-dimensional
case, the probability
) , , (
1 d
x x F
that
,
1 1
x
,
and d d
x
is called a distribution function. In
these terms, the probability measures on B ( R
d
)
are uniquely characterized by their distribution
functions. This result was frst proved by Lebesgue
and Stieltjes and is thus called the Lebesgue-
Stieltjes theorem.
From random vectors to slightly more gen-
eral random objects. The above defnition of a
random vector does not use any specifc features
of a multi-dimensional vector space R
d
so it can
be naturally generalized to the case when the
set C of objects (possible outcomes) is a locally
compact Hausdorff second countable topological
space. Such spaces will be described as LCHS,
for short.
From random vectors to random sets. As
we have mentioned, in some real-life situations,
the outcome is not a vector by a set of possible
vectors. In this case, possible outcomes are subsets
of the vector space R
d
, so the set C of possible
outcomes can be described as the set of all such
subsetsa power set P ( R
d
).
Need to consider one-element sets. An im-
portant particular case of this general situation
is the case when we know the exact vectorx. In
this case, the set of all possible vectors is the cor-
responding one-element set {x}. So, one-element
sets must appear as possible sets.
Restriction to closed sets. From the practi-
cal viewpoint, it is suffcient to only consider
closed sets.
Indeed, by defnition, a closed set is a set
which contains all the limits of its elements. If
x
n
S and x x
n
, then, by defnition of a limit,
this means that whatever accuracy we choose, we
cannot distinguish between x and values x
n

for
suffciently large n. Since x
n
S and x is undis-
tinguishable from x
n
, it makes sense to conclude
that x also belongs to Sthat is, that S is indeed
a closed set.
Since one-element sets are closed, this restric-
tion is in accordance with the above-mentioned
need to consider one-element sets. (In a more
general framework of a LCHS space, the require-
ment that all one-point sets be closed is one of the
reasons why we impose the restriction that the
topology must be Hausdorff: for Hausdorff topo-
logical spaces, this requirement is satisfed.)
With this restriction, the set C of possible
outcomes is the class of all closed subsets of the
space R
d
; this class is usually denoted by F (R
d
),
or simply F, for short.
21
Random Fuzzy Sets
It is worth mentioning that the decision to
restrict ourselves to closed sets was made al-
ready in the pioneering book (Mathron, 1975)
on random sets.
We need a topology on F. To fnalize our
defnition of closed random sets, we must specify a
-feld on F. To specify such a feld, we will follow
the same idea as with random vectorsnamely, we
will defne a topology on F and then consider the
-feld of all the Borel sets in this topology (i.e.,
the smallest -feld that contains all sets which are
open or closed in the sense of this topology.
In his monograph (Mathron, 1975), Mathron
described a natural topology that he called a hit-
or-miss topology. (It is worth mentioning that this
topology was frst introduced in a completely dif-
ferent problem: the construction of the regularized
dual space of a C
*
-algebra [Fell, 1961, 1962]).
Mathrons motivations and defnitions work
well for random sets. However, they are diffcult
to directly generalize to random fuzzy sets. Since
the main objective of this paper is to describe
and analyze random fuzzy sets, we will present
a different derivation of this topology, a deriva-
tion that can be naturally generalized to random
fuzzy sets.
This alternative defnition is based on the
fact that on the class of closed sets, there is a
natural order A B. The meaning of this order
is straightforward. If we have a (closed) set A of
possible vectors, this means that we only have
partial information about the vector. When we
gain additional information, this enables us to
reduce the original set A to its subset B. Thus, A
B means that the set B carries more information
about the (unknown) vector than the set A.
It turns out that in many important situations,
this order enables us to describe a natural topology
on the corresponding ordered set. Specifcally,
this topology exists when the corresponding order
forms a so-called continuous lattice. To describe
this topology in precise terms, we thus will need
to recall the basic notions of lattice theory and
the defnition of a continuous lattice.
Basics of lattice theory. Lattices are a par-
ticular case of partially ordered sets (also knows
as posets); so, to defne the lattice, we must frst
recall the defnition of a poset. A poset is defned
as a set X together with a relation which satisfes
the following properties:
(i) is refexive, that is, x x for all x X;
(ii) is anti-symmetric, that is, for all x, y X,
x y and y x imply that x =y;
(iii) is transitive, that is, for all x, y, z X, if
x y and y z, then x z.
An element u X is called an upper bound for
a pair x, y if x u and y u. Dually, an element
v X is called a lower bound for a pair x, y if v
x and v y. These concepts can be naturally
extended to arbitrary collections of elementsA
X: an element u X is an upper bound for A if
x u for all x A; an element v X is a lower
bound for A if v x for all x A.
The least upper bound is exactly what it sounds
like: the least of all the upper bounds. Similarly,
the greatest lower bound is the greatest of all the
lower bounds. In precise terms, an element u
X is called the least upper bound of the set A if it
satisfes the following two conditions:
1. u is an upper bound for the set A (i.e., x u
for all x A); and
2. if w is an upper bound for the set A, then u
w.
The least upper bound of a set A is also called
its join or supremum and is usually denoted by
A or sup A.
Similarly, an element v X is called the
greatest lower bound of the set A if it satisfes
the following two conditions:
1. v is a lower bound for the set A (i.e., v x
for all x A); and
2. if w is an lower bound for the set A, then w
v.
22
Random Fuzzy Sets
The greatest lower bound of a set A is also
called its meet or infmum and is usually denoted
by A or inf A.
For two-element sets, } , { b a is usually de-
noted by b a and } , { b a by b a .
A poset X is called a lattice if b a and b a
exist for all pairs a,b X. A poset X is called a
complete lattice if both A and A exist for
any set A X.
Examples of lattices. The set R of all real
numbers with a standard order is a lattice, with
) , max( b a b a = and ) , min( b a b a = . This set
is not a complete lattice, since it does not have
a join R: no real number is larger than all the
others.
A slight modifcation of this example makes
it a complete lattice: namely, the set of all real
numbers from an interval
] , [ x x
is a complete
lattice.
The set of all n- dimensional vectors
) , , (
1 n
a a a =
with a component-wise order
n n
b a b a b a & &
1 1
is also a lattice, with
) , max( ) (
i i i
b a b a = and
) , min( ) (
i i i
b a b a =
.
Another example of a lattice is the set of all the
functions, with )) ( ), ( max( ) )( ( x g x f x g f = and
)) ( ), ( min( ) )( ( x g x f x g f = .
Continuous lattices and Lawson topology.
In some posets, in addition to the original rela-
tion < (smaller), we can defne a new relation <
whose meaning is much smaller. This relation
is called way below.
The formal defnition of < requires two aux-
iliary notions: of a directed set and of a dcpo. A
set A X is called directed if every fnite subset
of A has an upper bound in A. Of course, in a
complete lattice, every set is directed.
A poset X is called a directed complete partial
order (dcpo), if each of its directed subsets D has
a supremum D . In particular, every complete
lattice is a dcpo.
We say that x is way below y (x <y) if for every
directed subset D for which y D, there exists
an element d D such that x d.
In particular, a one-element set D ={y} is
always directed, and for this set, we conclude that
x ythat is, that way below indeed implies .
Another simple example: for a natural order on the
set of real numbers, x <y simply means

x <y.
From the common sense viewpoint, we expect
that if x is way below y and z is even smaller than
x, then z should also be way below y. This is indeed
true for the above defnition: indeed, if x <y and z
x, then z <y, then for every D, we have x d for
some d and hence z x d, that is, z d for that
same element d D. Similarly, we can proof all
three statements from the following lemma.
Lemma 1. Let (L,) be a dcpo. Then, for any
u, x, y, z in L, we have:
(i) x < y implies x y;
(ii) u x < y zimplies u < z;
(iii) if x < z and y< z, then x y <z.
For example, to prove (iii), for every directed
set D, we must prove that z D implies that x
y d for some d D. Indeed, from x <<z and y
<<z, we conclude that x d
x
and y d
y
for some
d
x
, d
y
D. Since D is a directed set, there exists
an element d D which is an upper bound for
both d
x
and d
y
, that is, for which d
x
d and d
y

d. From x d
x
and d
x
d, we conclude that x d
and similarly, that y d, so d is an upper bound
for x and y. By defnition of the least upper bound
x y, it must be smaller than or equal than any
other upper bound, hence x y d. The state-
ment is proven.
We can defne a topology if we take, as a sub-
base, sets {y X : y <<x} and } : { y x X y /
for all x X. In other words, as open sets for
this topology, we take arbitrary unions of fnite
intersections of these sets. This topology is called
a Lawson topology.
It is worth mentioning that for the standard or-
der on real numbers, the sets {y X : y <<x}={y :
y <x} and } : { y x X y / ={y : y >x} are indeed
open, and the corresponding topology coincides
with the standard topology on the real line.
23
Random Fuzzy Sets
} : { y x X y /
There is an important reasonably general case
when the Lawson topology has useful properties:
the case of a continuous lattice. A complete lattice
X is called a continuous lattice if every element x
X is equal to the union of all the elements which
are way below it, that is, if } : { x y X y x << =
for all x X. It is known that on every continuous
lattice(X,), the Lawson topology is compact and
Hausdorff; see, for example, Gierz, Hofmann,
Keimel, Lawson, Mislove, and Scott (1980).
Final defnition of a (closed) random set and
the need for further analysis. In our analysis of
random sets, we will use the Lawson topology to
describe the -algebra of subsets of Fas the class
of all Borel sets in the sense of this topology.
From the purely mathematical viewpoint, this
is a (somewhat abstract but) perfect defnition.
However, since our objective is to make this defni-
tion applicable to practical problems, we need to
frst reformulate this general abstract defnition in
more understandable closer-to-practice terms.
THE LAWSON TOPOLOGY OF
CLOSED SETS
First try. Let us describe what these notions lead
to in the case of closed sets. For the class F of
closed sets, there is a natural ordering relation :
a set inclusion F F'. The corresponding poset
(F,) is indeed a complete lattice: indeed, for
every family F
i
(i I) of closed sets, there exist
both the infmum and the supremum:
}; : { } : { I i F I i F
i i
= F
}. : { of closure the } : { I i F I i F
i i
=
The resulting complete lattice is not con-
tinuous. Let us show that with as the ordering
relation , F is not continuous. Specifcally, as
we will show, that, for example, in the one-di-
mensional case R
d
=R, the defnition F= {G
F : G <<F} of a continuous lattice is violated
for F =R.
For that, we will prove that for every two sets
F and G, G<<F implies that G = . We will prove
this by reduction to a contradiction. Let us assume
that G and G <<F. Since the set G is not empty,
it contains an element x G. For this element x,
the one-point set S ={x} is a closed subset of G:

S G. By our frst-try defnition of , this means
that S G. We have already mentioned that if S
G and G <<F, then S << F. By defnition of
the way below relation <<, this means that if
F D, then there exists an element d D for
which S d. In particular, we can take as D the
family{d
n
}
n
, where for every positive integer n,
( ] [ ) + + - - = , ,
1 1
n n n
x x d
def
. It is easy to see that
) , , max(
1 1 k k
n n n n
d d d

= , so the family D is
indeed directed. The union

n
d
n
of these sets is
equal to ) , ( ) , ( + - x x . Thus, the closure D
of this union is the entire real line R. Since F is a
subset of the real line, we have F D. However,
for every d
n


D, we have x d
n
and thus,
n
d S /
. The contradiction shows that a non-empty set G
cannot be way below any other set F. Therefore,
the only set G for which G <<R is an empty set,
so {G F :G <R}=

R.
Correct defnition. Fortunately, a small
modifcation of the above defnition makes F a
continuous lattice. Namely, as the desired ordering
relation on the class F of all closed sets, we can
consider the reverse inclusion relation .
It is easy to show that (F,) is a complete lat-
tice: indeed,
}; : { of closure the } : { I i F I i F
i i
= F
}. : { } : { I i F I i F
i i
=
Let us prove that (F,) is a continuous lattice.
By defnition, a continuous lattice means that for
every closed set F F, we have F = {G F : G
<<F} for every set F F. Since G <<F implies
G F, that is, G F, we thus have {G F : G
<<F} F. So, to prove the desired equality, it is
suffcient to prove that F {G F:G <<F}.
24
Random Fuzzy Sets
We have already mentioned that in the lattice
(F,), the union is simply the intersection of
the corresponding sets, so the desired property
can be rewritten as } : { F G G F << F . It
turns out that it is easier to prove the equivalent
inclusion of complements, that is, to prove that
} , : { F G G G F
c c
<< F (where F
c
denotes
the complement of the set F).
There is no easy and intuitive way to im-
mediately prove this result, because the notion
of way below is complex and is therefore not
intuitively clear. So, to be able to prove results
about this relation <<, let us reformulate it in an
equivalent more intuitive way.
Lemma 2. For X = R
d
(and, more generally,
for an arbitrary locally compact Hausdorff second
countable topological space X), for F,G F (X),
F << G if and only if there exists a compact set
K X such that F
c
K G
c
.
Proof
1. Suffciency. Let F,G F(X) and let K be a
compact set such that F
c
K G
c
. Let G D
for some directed family D, that is, let G D.
We already know that is simply an intersec-
tion, that is, G D. In terms of complements,
we get an equivalent inclusion G
c
{d
c
: d
D}. Since K G
c
, we conclude that K {d
c
: d
D}. Complements d
c
to closed sets d D are
open sets. Thus, the compact K is covered by a
family of open sets d
c
, d D. By defnition of a
compact set, this means that we can select a fnite
subcover, that is, conclude that
c
n
c
F F K
1

for some closed sets F
i
D. Since F
c


K, we thus
have
c
n
c c
F F F
1
, that is, equivalently,
n
F F F
1
, that is,
n
F F F
1
.
Since sets
n
F F , ,
1
belong to the directed
family D, this family must also contain an upper
bound d D. By defnition of the least upper
bound, we have d F F
n

1
, hence F d.
Thus, indeed, G << F.
2. Necessity. Let F << G. Since the underly-
ing topological space X is locally compact,
each point x G
c
has a compact neighborhood
c
x
G Q such that its interior
x
Q

contains x. We
thus have } : {
c
x
c
G x Q G =

or, equivalently,
}. : ) {( } : ) {(
c c
x
c c
x
G x Q G x Q G = =

Finite unions of closed sets
c
x
Q ) (

form a
directed family D, for which the union is the
same set G. Since G D and F << G, we con-
clude (by defnition of the way below relation)
that } , , 1 : ) {( n i Q F
c
x
i

= , x
i


G
c
. Thus,
c
x
n
i
i
Q F ) (
1

or, equivalently, ) (
1
i
x
n
i
c
Q F

=
.
Therefore,
c
x
n
i
c
G Q F
i

=1
. Since each set
i
x
Q
is compact, their union is also compact, so we have
the desired inclusion F
c
K G
c
with
i
x
n
i
Q K
1 =
=
.
The lemma is proven.
Proof that (F(X), ) is a continuous lat-
tice. Let us now use this Lemma 2 to show
that (F(X), ) is a continuous lattice. We have
already mentioned that to prove this fact, we must
prove that } , : { F G G G F
c c
<< F for every
closed set F F(X). Indeed, let F be a closed
set. Since X is locally compact, for every point
x from the open set F
c
, there exists a compact
neighborhood K
x
F
c
such that its interior x K


contains x. The complement
c
x K A ) (

= to this
(open) interior is a closed set A F(X), for which
c
x
x
c
F K K A x =

. So, due to Lemma 2,
A <<F. In other words, if x F
c
, then there is
an A F such that x A
c
and A <<F. So, we
conclude that F
c
{ G
c
: G F,G <<F} that is, that
F(X) ) is indeed a continuous lattice.
Lawson topology on the class of all closed
sets. Since F(X) ) is a continuous lattice, we
can defne Lawson topology for this lattice.
For this lattice, let us reformulate the general
abstract notion of the Lawson topology in more
understandable terms. In the following text, we
will denote the class of all compact subsets of the
space X by, K(X) and the class of all open subsets
25
Random Fuzzy Sets
of X by O(X). When the space X is clear from the
context, we will simply write K and O.
Theorem 1. For every LCHS X,the Lawson
topology on the continuous lattice (F(X), ) is
generated by the subbase consisting of subsets
of F of the form } : { = = K F F
K
F F
def
and
} : { = G F F
G
F F
def
, where K K K K
and G O.
Comment. The class } : { = K F F F is
the class of all random sets F which miss the set
K, and the class {F F : F G } is the class
of all random sets F which hit the set G. Because
of this interpretation, the above topology is called
the hit-or-miss topology (Mathron, 1975). So,
the meaning of Theorem 1 is that the Lawson
topology on (F(X), ) coincides with the hit-or-
miss topology.
Proof. By definition, the Lawson topology
has a subbase consisting of sets of the form
{F' F : F << F'} and } : { F F F / F for
all F F, where is . It turns out that to
prove the theorem, it is suffcient to reformulate
these sets in terms of .
1. Clearly:
}. : { } : { = /
c
F F F F F F F F
2. For K K, we have:
}. : { } : {
,
F F F K F F
c
K F F
<< = =

F F
F

Indeed, if F F and F K =, then K is a


subset of the open set F
c
. Since the space X is locally
compact, there exists an open set B and a compact
K' such that K B K' F
c
. The complement
G =B
c
to an open set B is a closed set for which

K G
c
K' F
c
. By Lemma 2, this means that
G << F and, of course, G K
c
.
Conversely, let f F and F' <<F with F' K
c
.
Then, by the same Lemma 2, there is a compact
set K' with(F')
c


K 'F
c
. On the other hand, F'
F
c
implies that K (F')
c
, so we have K F
c

and K F =. The theorem is proven.
From topology to metric: need and possi-
bility. We have defned topology on the class of
all closed sets. Topology describes the intuitive
notion of closeness in qualitative terms. From
the viewpoint of applications, it is convenient
to use quantitative measures of closeness such
as a metric.
Before we start describing a correspond-
ing metric, let us frst prove that this metric is
indeed possible, that is, that the corresponding
topological space is metrizable. It is known that
for every continuous lattice, the corresponding
Lawson topology is compact and Hausdorff. Let
us prove that for LCHS spaces X, the Lawson
topology on (F(X), ) is not only compact and
Hausdorff, it is also second countableand
therefore metrizable.
Theorem 2. For every LCHS X, the Lawson
topology on (F(X), ) is second countable.
Proof. This proof is given in Mathron (1975)
for the hit-or-miss topology. We reproduce it here
for completeness.
Since X is locally compact Hausdorff second
countable space, there exists a countable basis
B for the topology O consisting of relatively
compact sets B B (i.e., sets whose closure B
is compact). The fact that B is a basis means that
every open set G O can be represented as a
union of open sets from this basis, that is, that it
can be represented in the form
i
I i
B G

= , where
B
i


B, and G Bi .
By defnition of the hit-or-miss topology, its sub-
base consists of the sets
} : { = = A F F
A
F F
for
compact sets A and sets } : { = A F F
A
F F
for open sets A. Thus, the base of this topology
26
Random Fuzzy Sets
consists of the fnite intersections of such sets. The
intersection of two sets } : { = = A F F
A
F F
def

and } : { = =

A F F
A
F F
def
consists of all the
sets F which do not have common points neither
with A nor with A'; this is equivalent to saying that
F has no common points with the union A A', that
is, that
A A A A
= F F F . Thus, fnite intersections
which form the basis of the hit-or-miss topology
(or, equivalently, Lawson topology) have the form
} , , 2 , 1 , , : {
, ,
1
n i G F K F F
i G G
K
n

= = = F F
def
for all possible K K and G
i


O.
Let us consider the following subset of this
basis:
{ }
. , , 0 , ;
1
1
, ,
B F T =

j i
D
B B
D B k m
k D
m


Clearly T is countable. We need to verify that
T is a basis for the Lawson topology. For this, we
need to prove that for every element F F in
every open neighborhood of F, there is an open
set from T that contains F. It is suffcient to prove
this property only for the neighborhood from the
known basis.
Indeed, let F F and let the open set
K
n
G G , ,
1

F
be a neighborhood of F. We need to fnd a U T
such that .
, ,
1
K
n
G G
U F

F We will prove this by
considering two possible cases: F = and F .
If F =, then we cannot have F G
i
, so n
must be 0, and
K
G G
K
n
F F =
, ,
1

. Since B is a basis, the


compact set K can be covered by open sets from
this basis. Since K is compact, we can extract a
fnite cover from this cover, that is, conclude that
k
D D K
1
for some elements D
j


B. Thus,
k D D K 1 . Clearly, the empty set has no
common point with anyone, so
.
1 k D D
F


F
If a set has no common points with a larger
set, then it will no common points with a subset
either; so we conclude that
k D D k
F F F
1
. So,
we can take
k D D
U

=
1
F as the desired neigh-
borhood.
If

F =, then the fact that F belongs to the
neighborhood
K
n
G G , ,
1

F means that F G for


every i. This means that for every n i , , 2 , 1 = ,
we can pick a point x
i
F G
i
. Since B is a basis
for every i, there exists an open set Bi B
i
B B
such that
c
i
i
i i
K G B B x .
This means that our set F not only has no com-
mon points with K, it also has no common points
with the closed sets i B . Since the closed sets K and

=
i
n
i
B F
1
are disjoint and the space X is Hausdorff,
we can fnd two disjoint open sets A
1
and A
2
which
contain these sets: K A
1
and 2
1
A B F i
n
i

. Since B
is a basis, we can represent the open set A
1
as a union
of open sets fromB:
j
j
D A K =
1
for some D
j


B
for which
1
A Dj . The set K is compact and thus,
fromthis open cover of K, we can extract a fnite
sub-cover, that is, we have:
,
2 1
1 1
c
j
k
j
j
k
j
A A D D K
= =

and =

=
2
1
A Dj
k
j
.
Thus, .
, , , ,
1
1
1
K
n
k D
n
G G
D
B B
U F

F F
def
=

The theorem is proven.
METRICS ON CLOSED SETS
From theoretical possibility of a metric to a
need for an explicit metric. In the previous sec-
tion, we have shown that for every LCHS X (in
particular, for X =R
d
), when we defne Lawson
topology on the set F(X) of all closed subsets of
X, then this set F(X) becomes metrizable. This
means that in principle, this topology can be
generated by a metric.
However, from the application viewpoint, it
is not enough to consider a theoretical possibility
of a metric, it is desirable to describe a specifc
explicit example of a metric.
Known metric on the class of all closed
sets: Hausdorff metric. Intuitively, the metric
describes the notion of a distance between two
27
Random Fuzzy Sets
sets. Before we start describing an explicit metric
which is compatible with the Lawson topology,
let us frst describe natural ways to describe such
a distance.
For points x, yR
d
, the standard distance d(x, y)
can be described in terms of neighborhoods: the
distance is the smallest value e >0 such that x
belongs to the

e-neighborhood of y and y is in the
e-neighborhood of x.
It is natural to extend the notion of e-neigh-
borhood from points to sets. Namely, a set A is
collection of all its points; thus, an e-neighborhood
(A) N of a set A is a collection of e-neighborhoods
of all its points. In other words, we can defne:
}. some for ) , ( : { ) ( A a a x X x A N =
def
Now, we can defne the distance d
H
(A, B) be-
tween the two sets as the smallest smallest value
e > 0 for which A is contained in the e-neighbor-
hood of B and B is contained in the e-neighbor-
hood of A:
)}. ( & ) ( : 0 { inf ) , ( A N B B N A B A H > =
def
This metric is called the Hausdroff metric
(after the mathematician who proposed this
defnition).
Examples. Hausdorff distance between a
one-point set {0.3} and an interval [0,1] is 0.7:
indeed, all the elements from the set {0.3} are
contained in the interval [0,1], and every element
of the interval [0,1] is contained in the 0.7-vicinity
of the point 0.3.
This example can be viewed as a degenerate
case of computing the Hausdorff distance be-
tween two intervals
] , [ a a
and ] , [ b b namely,
the case when a a = . In general, the Hausdorff
distance between the two intervals is equal to
|) | |, max(| b a b a - - .
Limitations of the known metric. Hausdorff
metric works well for bounded sets in R
d
or, more
generally, for subsets of a compact set. However,
if we use the Haudorff metric to described dis-
tances between arbitrary closed sets A, B R
d
,
we often get meaningless infnities: for example,
in the plane, the Hausdorff distance between any
two non-parallel lines is infnity.
How to overcome these limitations: the no-
tion of compactifcation. To overcome the above
limitation, it is desirable to modify the defnition
of a Hausdorff metric.
Since Hausdorff metric works well for compact
spaces, a natural idea is to somehow embed the
original topological spaces into a compact set.
Such a procedure is known as compactifcation.
Simplest compactifcation. In the simplest
compactifcation, known as Alexandroff (or one-
point) compactifcation, we simply add a single
point to the original space X.
The corresponding topology on X {} is
defned as follows:
if a set S G does not contain the new point
, then it is open if and only if it was open
in the original topology;
if a set S contains , then it is called open if
and only if its complement S
c
is a compact
subset of the original space X.
From points to closed sets. This compactif-
cation can be extended to an abritrary closed set
F: namely, a set F which was closed in X is not
necessarily closed in the new space X {} so
we have to take the closure
F
of this set.
If we have a matric d' on the compactifcation,
then we can defne a distance between closed sets
F, F' F as ) , ( F F H

.
Compactifcation of R
d
. For R
d
, the one-point
compactifcation can be implemented in a very
natural way, via a so-called stereographic pro-
28
Random Fuzzy Sets
jection. Specifcally, we can interpret a one-point
compactifcation of the space R
d
as a sphere S
d

R
d +1
. The correspondence between the original
points of R
d
and S
d
is arranged as follows. We
place the space R
d
horizontally, and we place the
sphere S
d
on top of this plane, so that its South
poleis on that plane. Then, to fnd the image p (x)
of a point x R
d
on the sphere, we connect this
point x with the North pole of the sphere by a
straight line, and take, as p (x), the intersection
between this line and the sphere. In this manner,
we cover all the points on the sphere except for
the North pole itself. The North pole corresponds
to infnityin the sense that if x , then p (x)
tends to the North pole.
In this sense, S
d
is a one-point compactifcation
of the original space R
d
: it is a compact space,
and it is obtained by adding a point (North pole)
to (the image of) R
d
.
From points to closed sets. By using this
construction, we can naturally fnd the projec-
tion of an arbitrary closed set F F (R
d
) as the
collection of all the projections of all its points,
that is, as p(F) ={p(x) : x F}. The only minor
problem is even while we started with a closed
set F, this collection p(F), by itself, may be not
closed. For example, a straight line on a plane is
closed, but its projection is not closedsince it
has point arbitrary close to the North pole but not
the North pole itself.
To resolve this problem, we can then take
a closure of this image. Thus, we arrive at the
following stereographic Hausdorff metric
) ) ( , ) ( ( F F H

, where d' is the standard dis-


tance on the sphere S
d
.
Relation between this metric and the Law-
son topology. It turns out that the Lawson topol-
ogy on F (R
d
) is compatible with the Hausdorff
metric on a one-point compactifcation.
For the stereographic Hausdorff metric, this
equivalence is proven, for example, in Rockafeller
and Wets (1984); for the general LCHS space, this
is proven in Wei and Wang (in press).
RANDOM CLOSED SETS: FINAL
DEFINITION AND CHOQUET
THEOREM
Final defnition of a random (closed) set. With
the material developed in previous sections, we
are now ready to formulate rigorously the con-
cept of random closed sets as bona fde random
elements. These are generalizations of random
vectors and serve as mathematical models for
observation processes in which observables are
sets rather than points.
Again, let F (R
d
) (F(X)) be the space of closed
sets of R
d
, or more generally, of a LCHS space
X. Let (F) denote the Borel -feld of subsets of
F where F is equipped with the Lawson topol-
ogy. Let (,A,P) be a probability space. A map
F : V , which is A- (F)-measurable, is called
a random closed set (RCS) on R
d
. As usual, the
probability law governing V is the probability
measure P
V
=PV
-1
on (F).
Application-motivated need to describe a
random set. A random set is, in effect, a probabil-
ity measure on the class of all closed subsets F(X)
of the LCHS X. In principle, we can thus describe
the random set by listing, for different subsets S
F, the corresponding probability P(S).
From the application viewpoint, however, this
description is duplicatingfor example, since
due to additivity, the probability P
V
(S S') of
a union of two disjoint sets is equal to the sum
P
V
(S) +P
V
(S') of the probabilities P
V
(S) and
P
V
(S') and thus, does not have to be described
independently.
For random numbers and for random vec-
tors, the solution was to only give probability
of sets from a given subbase. For real numbers,
this subbase consisted of sets ] , ( x - which
let to the notion of a probability distribution.
29
Random Fuzzy Sets
For elements of R
d
, we considered sets of the
type ] , ( ] , (
1 d
x x - - . For the hit-or-miss
topology on the space F(X), the subbase consists
of the sets } : { = A F F for compact A and
} : { A F F for open sets A. It is therefore
reasonable to defne the probability only on the
sets of this type.
The set } : { = A F F is a complement to the
set } : { A F F , so we do not have to describe
its probability separately: it is suffcient to describe
the probability of the sets } : { A F F .
Resulting description: capacity function-
als. As a result, we arrive at the following def-
nition. Once a random closed set X is defned,
we can defne the mapping ] 1 , 0 [ : K T as
) ( } ) ( : { ) (
K X
P K X P K T F = = . Thi s
mapping is called a capacity functional of a
random closed set.
Is this description suffcient? A natural
question is whether from this functional we can
uniquely determined the corresponding prob-
ability measure. A related question is: what are
the conditions under which such a measure is
possible?
It can be checked that T satisfes the following
properties:
i.

1 ) ( 0 T ,T() = 0 ;
ii. If K
n
K

then TK
n
T(K);
iii. T is alternating of infnite order, that is,
T is monotone (with respect to ) and for
n
K K K , , ,
2 1
, n 2,
( )
| | 1
1
{1, 2, , }
( 1) ,
n
I
i i
i i I
I n
T K T K
+
=


-



where | |I denotes the cardinality of the set I.
It turns out that under these conditions, there
is indeed such a measure.
Theorem (Choquet) Let K : T R. Then the
following two statements are equivalent to each
other:

there exists a probability measure Q on
(F) for which ) ( ) ( K T Q
K
= F for all K
K;
T satisfes the conditions i, ii and iii.
If one of these statements is satisfed, then the
corresponding probability measure Q is uniquely
determined by T.
For a proof, see Mathron (1975).
This theorem is the counter-part of the Leb-
esgue-Stieltjes theorem characterizing probability
measures on the Borel -feld of R
d
in terms of
multivariate distribution function S of random
vectors. For subsequent developments of RCS,
see for example Nguyen (2006).
FROM RANDOM SETS TO RANDOM
FUZZY SETS
Need for fuzziness. As stated in the Introduction,
random set observations are coarse data contain-
ing the true, unobservable outcomes of random
experiments on phenomena.
A more general type of random and imprecise
observations occurs when we have to use natural
language to describe our perception. For example,
the risk is high is an observation contain-
ing also imprecision at at a higher level due to
the fuzziness in our natural language. Modeling
fuzziness in natural language, that is, modeling
the meaning of terms is crucial if we wish to
process information of this type.
How we can describe fuzziness. Following
Zadeh, we use the theory of fuzzy sets to model
the meaning of a nature language; see, for ex-
ample, Nguyen and Walker(2006) for fuzzy sets
and fuzzy logics.
The theory is in fact valid for any LCHS space
X. For concreteness, one can keep in mind an
example X= R
d
. By a fuzzy subset of X we mean
a function ] 1 , 0 [ : X f where f(x) is the degree
30
Random Fuzzy Sets
to which the element x is compatible with the
fuzzy concept represented by f. This function f
is also called a membership function.
For example, the membership function f of the
fuzzy concept A=small non-negative numbers
of R could be f(x)=0 if x <0,e
-x
for x 0, where
f(x) =e
-x
is the degree to which x is considered
as a small non-negative number.
Ordinary subsets of X are special cases of
fuzzy subsets of X, where for A X, its indica-
tor function I
A
: R
d
] 1 , 0 [ } 1 , 0 { , I
A
(X) =1 if x
A, and I
A
(X) =0 if x A, is the membership
function of A.
An alternative way to describe fuzzy sets: a-
cuts. Informally, the fact that the actual (unknown)
value
act
X satisfes the property described by a
fully set f means the following:
with confdence 1, the actual value x
act
satis-
fes the condition f(x
act
) > 0;
with confdence 0.9, the actual value x
act
satisfes the condition f(x
act
) >1 0.9 =0.1,
that is, belongs to the set {x : f (x) 0.1};
and so on.
In view of this interpretation, instead of de-
scribing the fuzzy set by its membership function
f(x), we can alternatively describe it by the corre-
sponding sets } ) ( : { = x f X x A
def
for differ-
ent a [0,1]. These sets are called alpha-cuts.
Alpha-cuts are nested in the sense that if a <
a' , then A
a


A
a
. So, a fuzzy set can be viewed
as a nested family of sets (corresponding to dif-
ferent degrees of confdence); see, for example,
Klir and Yuan (1995), Nguyen and Kreinovich
(1996), and Nguyen and Walker (2006).
Fuzzy analog of closed sets. In our description
of random sets, we limited ourselves to closed
sets. Since a fuzzy set can be viewed as a nested
family of sets (its alpha-cuts), it is reasonable to
consider fuzzy sets in which all alpha-cuts are
closed sets.
In other words, it is reasonable to restrict
ourselves to fuzzy sets ] 1 , 0 [ : X f which have
the following property: for every a R, the set
} ) ( : { = x f X x A is a closed subset of X.
In mathematics, functions f with this property
are called upper semicontinuous (usc, for short).
We will thus call a fuzzy set a fuzzy closed set if
its membership function is usc.
Closed sets are a particular case of fuzzy
closed sets. We have already mentioned that tra-
ditional (crisp) sets are a particular case of fuzzy
sets. It is easy to check that closed crisp sets are
also closed as fuzzy sets.
Moreover, a subset of X is closed if and only if
its indicator function is upper semicontinuous.
Towards a defnition of a random fuzzy
closed set. To formalize the concept of random
fuzzy (closed) sets, we will therefore consider the
space USC(X) (where X is a LCHS space), of all
usc functions on X with values in [0,1].
As we mentioned earlier, a natural way to
defne a notion of the random fuzzy closed set
is to describe a topology on the set USC(X).
Similar to the case of closed sets, we will use the
Lawson topology generated by a natural order
on USC(X).
The space USC(X) has a natural pointwise
order: f g if and only if f (x) g(x) for all x X.
We can also consider the dual order f g (which
is equivalent to g f ). We will now look for the
corresponding Lawson topology!
THE CONTINUOUS LATTICE OF
UPPER SEMICONTINUOUS
FUNCTIONS
First try: USC(X), . Let X as any LCHS space.
By USC(X) we mean the set of all usc functions
] 1 , 0 [ : X f .
With the order relation , USC(X) is complete
lattice, where:

31
Random Fuzzy Sets
{ } J j x f x f
j j
J j
=

), ( inf ) ( and,
{ } A x x f
j
J j
=

: ] 1 , 0 [ sup ) (
where }. ) ( : { of closure =

y f X y A
j
J j

Remarks
(i) Clearly

) ( )} ( , , inf{ X USC X USC f J j f f
j j
= .
I ndeed, l et a [0,1] , then
} ) ( : { } ) ( : { < = <

x f x x f X x
j
J j
is an
open set.
(ii) Let us explain why we cannot simply use
pointwise supremum to defne .
Indeed, for
[ )
1 ), ( ) (
,
1
=
+
n x I x f
n
n
, we have
fn USC(X), but the pointwise supremum
[ )
{ }
) ( ) ( 1 ), ( sup
) , 0 (
,
1
X USC x I n x
I
n
n

/
=
+
+
.
To defne } : { J j f
j
, we proceed as follows:
For a [0,1], let

}, ) ( : { of closure =

y f X y A
j
J j

and def i ne } : ] 1 , 0 [ sup{ ) ( A x x f = .


We claim that f is the least upper bound of
)} ( , : { X USC f J j f
j j
in USC(X) with respect
to .
Clearly each A
a
is closed and a < b implies A
b


A
a
. As such f USC(X).
Next, for x X, we have

) (
)} ( ) ( : { of closure
x f i j
J j
i
A x f y f y x =

for any i J. Thus, for every i J, we have f


i
(x) f (x). Since this is true for every x, we thus
conclude that f f
i
for all i J.
Now, for g USC(X), we can write
)} ( : ] 1 , 0 [ sup{ ) ( g A x x g =

w h e r e
} ) ( : { ) ( = y g X y g A .
For any } ) ( : {

x f x y
j J j
, that is, for any
y for which f
j
(y) a for some j, we have g (y)
a if g f
j
for all i I. Thus, y A
a
(g), implying
that A
a


A
a
(g), since A
a
(g) is closed. But then
)} ( : ] 1 , 0 [ sup{ } : ] 1 , 0 [ sup{ g A x A x
and hence f g.
(iii) However, the complete lattice (USC(X), )
is not continuous.
The proof of this statement is similar to the
proof that the lattice (F(X) is not continuous.
Indeed, let ] 1 , 0 [ : X f be a function for which
f(x) =1 for all x X. Let us then show that the
zero function, that is a function g for which g(x)
=0 for all x X, is the only function in USC(X)
which is way below f.
It suffces to show that for any real number r
>0, the function ) (
} {

y
I r (e.g., ) (
} 0 { 2
1
I ), is not
way below f.
Let:
) , ( ) , (
1 1
) (
+ + - -
=
n n
y y
n
I x f
,
then 1
1
=

n
n
f , but for any K, we have
0 ) (
1
=

=
y f
n
K
n
,
thus
n
K
n
y
f I r

/
=1
} {
, implying that
} {y
I r is not
way below f.
Correct description: (USC,). Thus, as in
the case of F(X), we should consider the reverse
order .
32
Random Fuzzy Sets
Theorem 3. The complete lattice (USC(X),
), where } : { inf } : { J j f J j f
j j
=

and
h J j f
j
= } : { with:
} : ] 1 , 0 [ sup{ ) ( A x x h = and
}, ) ( : { of closure =

y f X y A
j
J j


is continuous.
Before proving this theorem, we need a repre-
sentation for elements of USC(X). For every real
number r and for every compact set K K(X), let
us defne an auxiliary function g
r, K
as follows:
g
r, K
(x) =r if x K and g
r, K
(x) =1 otherwise.
Lemma 3. Any f USC(X) can be written as
{ }, all for ) ( : ) ( inf ) (
,
K y r y f g f
K r
< = where
infmum is taken over all pairs r [0,1] and

K
K(X) for which f (y) < r for all y K.
Comment. In this defnition, the infmum of
an empty set is assumed to be equal to 1.
Proof. Let x X. Let us consider two possible
cases: f(x) =1 and f(x) <1.
(i) Let us frst consider the case when f(x) =1.
To prove the formula for this case, we will con-
sider two cases: when

K x
/
and when

K x .
In the frst subcase, when

K x
/
, we have g
r,
K
(x) = 1 by defnition of the function g
r, K
. Thus,
the infmum is equal to 1, that is, to f(x).
In the second subcase, when

K x
/
, then there
is no r such that f(y) <r for all y K. Thus, the
infmum is also equal to 1 = f(x).
(ii) Let us now consider the case when f(x) <1.
We need to prove that f(x) is the greatest lower
bound of the values g
r, K
(X). Let us frst prove
that f(x) is a lower bound. For that, we will again
consider two subcases:

K x
/
and when

K x .
When

K x , we have f(x) <r =g


r, K
(x). When

K x
/
, we have f(x) <1 =g
r, K
(x). Thus, in both sub-
cases, f(x) is indeed a lower bound of {g
r, K
(x)}.
Let us now prove that f(x) is the greatest lower
bound. In other words, let us prove that for any
e >0, the value f(x) +e is not a lower bound of
{g
r, K
(x)}.
Indeed, let
2 0
) ( + = x f r , then x belongs to
the open set { }. ) ( ) ( :
2 0
+ = < x f r y f X y By
local compactness of X, we conclude that there
is a compact set
} ) ( : {
0 0
r y f y K <
such that
0

K x , implying that + < = ) ( ) (


0 ,
0 0
x f r x g
K r
,
that is, that for every y K
0
, we have f(y) <r
0
and
+ < ) ( ) (
0 0
,
x f x g
K r
. The Lemma is proven.
Now, we are ready to prove the theorem.
Proof of Theorem 3. Consider (USC(X),
). For f USC(X), we always have f inf{g
USC(X) : g <<f }, thus it is suffcient to show
that (see Box 1.)
To prove this relation, it is suffcient to show
that g
r, K
<f for any (r, K) such that f(y) <r for
all y K.
Indeed, let F USC(X) be a directed set such
that F f , that is, f F inf (pointwise). To
prove the desired way below relation, we need
to fnd h F such that h g
r, K
.
For (r, K) with r > f inf F (pointwise
on K), we will show that there exists an h
F such that h < r on K. For any h F, denote
)}. ( : { x h r K x K
h
=
def
Since h is usc, the set K
h
is a closed set.
Let us prove that the intersection
K
F h

of all these
sets K
h
is the empty set. Indeed, if x
K
F h

, then by
{ }
. all for ) ( : inf } : ) ( { inf
,
K y r y f g f f g X USC g
K r
< = <<
Box 1.
33
Random Fuzzy Sets
defnition of K
h
, it means that r h(x) for all h
F. This will imply that r inf F on K, contradict-
ing the fact that r inf F on K.
Since every set K
h
is closed, its complement
c
h
K is open. From
=

h
F h
K
, we conclude that
X K
c
h
F h
=

and thus,
c
h
F h
K K

. Since K is a
compact set, from this open cover, we can extract
a fnite subcover, hence
c
i
h
n
i
K K
1 =
for some func-
tions h
i
. Thus, we have
c
h
n
i
K K
i

=1

. Since K
h


K
c
for all h, we thus conclude that .
1
=
=
i
h
n
i
K
Si nce F i s di rected, we have
. } , , 1 : { inf ) , , 1 : ( F n i h n i h h
i i
= = = =
def
For this h, we have = =
=
i
h
n
i
h
K K
1
. This means
that for every x K, we have h(x) <r.
Now, for an arbitrary point x X, we have two
possibilities: x X and K x
/
. If x X, then h(x)
<r < g
r,K
(x). On the other hand, if K x
/
, then
h(x) 1 and g
r,K
(x) =1, hence also h(x) g
r,K
(x).
So, for all x X, we have h(x) g
r,K
(x). The state-
ment is proven.
Towards an application-oriented descrip-
tion of the corresponding Lawson topology.
Since (USC,) is a continuous lattice, we can
defne the corresponding Lawson topology.
The Lawson topology is defned in very gen-
eral, very abstract terms. To be able to effciently
apply this abstractly defned topology to random
fuzzy closed sets, it is desirable to describe the
Lawson topology for such sets in easier-to-un-
derstand terms. Such a description is given by
the following result.
Theorem 4. The following sets form a subbase
for the Lawson topology on (USC(X),): the sets
of the form {f : f(y) < r for all y K} for all r
(0,1] and K K(X), and the sets {f : g(x) < f(x)
for somex X} for all g USC(X).
Proof. By definition, the Lawson topol-
ogy on USC(X), ) is generated by the sub-
base consisting of the sets {f : g <<f} and
} : { f g f / . For the ordered set (USC(X), ), clearly,
}. some for ) ( ) ( : { } : { X x x f x g f f g f < = /
Let us now considers sets {f : g <<f }. It is
known that these sets form a base of a topology
which is called Scott topology; see, for example,
Gierz et al. (1980). A Scott open set is thus an
arbitrary union of sets of the type {f : g <<f }
with different g. So, to prove our theorem, it is
suffcient to prove that the sets {f : f (y) < for all
y K form a subbase of the Scott topology.
To prove this, we frst prove that each such set
is indeed a Scott open set; this is proved in the
Lemma below. It then remains to verify that the
above sets indeed form a subbase for the Scott
topology.
Indeed, let A be a Scott open set which contains
an element h. Since } : { f g f A
g
<< = , we have
h {f : g <<f } for some g, that is, g << h. It is
easy to show that:
{ } { }. all for ) ( : all for ) ( : inf
, ,
,
K y r y h g K y r y h g h
K r K r
K r
< = < =
{ } { }. all for ) ( : all for ) ( : inf
, ,
,
K y r y h g K y r y h g h
K r K r
K r
< = < =
}. : ) ( { } all for ) ( : ) ( {
,
f g X USC f K y r y f X USC f
i
i
K r
K K
<< = <

Box 2.
}. : ) ( { } all for ) ( : ) ( {
,
f g X USC f K y r y f X USC f
i
i
K r
K K
<< <

Box 3.
34
Random Fuzzy Sets
Thus, by defnition of the way below rela-
tion, g < f implies that
= = < } , , 2 , 1 , all for ) ( : {
,
n i K y r y h g g
i i K r
i i


}, , , 2 , 1 , all for ) ( : { inf
,
n i K y r y h g
i i K r
i i
= <
and hence,
}. all for ) ( : {
1
i i
n
i
K y r y f f h <
=

Let us show that this intersection is indeed a


subset of A. Observe that for every f USC(X), we
have
f g
i i
K r
<<
,
if f(y) <r
i
for all y K
i
. Now let
f

be an arbitrary function from the intersection,


that is,
}. all for ) ( : {

1
i i
n
i
K y r y f f f <
=

Then:

{ }
f n i K y r y h g g
i i K r
i i

, , 1 , all for ) ( : inf


,
<< = < =
def
and hence, by the representation of A,
. } , , 1 , all for ) ( : {
1
A n i K y r y f f
i i
n
i
= <
=

To complete the proof, it is therefore suffcient
to prove the following lemma:
Lemma 4. For every r (0,1] and K K(X),
we have (see Box 2).
Proof. Let f USC(X) be such that f(y) <r for
all y K. Since f is usc, the set A ={x X : f (x)
<r} is an open set of X which contains K. Thus,
there exists separating sets: an open set U and
a compact set V such that K U V A. Since
U V and U is an open set, we thus conclude
that U V.
By defnition of the function g
r,K
, from V A,
we conclude that g
r,V
<<f. So (see Box 3).
Conversely, if f g
i
K r
<<
,
, where
K K

, then
for every y K, there exist r
y
and K
y
such that y
K
y
and ) ( ) (
,
z g r z f
i
K r y
< for all z K
y
. In
particular, for z = y, we r y g r y f
i
K r y
= < ) ( ) (
,
,
so that f(y) <r for all y K. The lemma is proven,
and so is the theorem.
METRICS AND CHOQUET
THEOREM FOR RANDOM FUZZY
SETS
Resulting formal defnition of a random fuzzy
set. As we have mentioned, in general, a random
object on a probability space (,A,P) is defned
a mapping O x : which is A-B-measurable
with respect to an appropriate -feld B of subsets
of the set O of objects.
For closed fuzzy sets, the set O is the set USC(X)
of all semicontinuous functions, and the appropri-
ate -algebra is the algebra L(X) of all Borel sets
in Lawson topology.
Thus, we can defne a random fuzzy (closed)
set S on a probability space (,A,P) as a map
) ( : X USC S which is A-L(X)-measurable.
Properties of the corresponding Lawson
topology. From the general theory of continuous
lattices, we can conclude that the space USC(X)
is a compact and Hausdorff topological space.
When X is a LCHS space, then, similarly to the
case of the set of all (crisp) closed sets F(X), we
can prove that the set of all fuzzy closed sets
USC(X) is also second countable (see the proof
below) and thus, metrizable.
Later in this section, we will discuss different
metrics on USC(X) which are compatible with
this topology, and the corresponding Choquet
theorem.
Theorem 5. The topological space USC(X)
with the Lawson topology is second countable.
Proof. It is known that for every continuous
lattice, the Lawson topology is second countable
if and only if the Scott topology has a countable
base (Gierz et al., 1980). Thus, to prove our result,
it is suffcient to prove that the Scott topology has
a countable base.
In view of the results described in the previous
section, the Scott topology has a base consisting
of sets of the form
35
Random Fuzzy Sets
}, all for ) ( : ) ( {
1
i i
n
i
K y r y f X USC f <
=

where r
i
(0,1], K
i
K(X), and n 0. Let us
denote
}. all for ) ( : { ) , (
i i i i
K y r y f f K r U < =
def
In these terms, the base of the Scott topology
consists of the fnite intersections
) , (
1
i i
n
i
K r U
=

.
Recall that since X is LCHS, X is normal, and
there is a countable base B of the topological space
(X,O) such that for every B B, the closure B is
compact, and for any open set G O, we have
j
J j
B G

=
for some B
j
B with B
j
G.
Our claim is that a countable base of USC(X)
consists of sets of the form:

= =
ij
m
j
i
n
i
B q U
i
1 1
,
,
where q
i
are rational numbers from the interval
(0,1], and B
ij
B.
It suffces to show that every neighborhood
) , (
1
i i
n
i
K r U
=
of a closed fuzzy set f
n


USC(X)
contains a sub-neighborhood

= =
ij
m
j
i
n
i
B q U
i
1 1
, (which still contains f ).
Indeed, by defnition of the sets U (r
i
,K
i
), the
fact that ) , (
1
i i
n
i
K r U f
=
means that for every i and
for every y K
i
, we have f(y) <r
i
. Let us denote
byA
i
the set of all the values x X for which f(x)
r
i
:A
i
={x X : f(x) r
i
}. Since the function f
is upper semicontinuous, the set A
i
is closed. The
sets A
i
and K
i
are both closed and clearly A
i


K
i
=. Since the space X is normal, there exists
an open set G
i
that separates A
i
and K
i
, that is,
for which K
i
G
i
and A
i


G
i
= . Due to the
property of the base, we have ij
J j
i
B G

=
with
i
ij G B . Thus,
ij
J j
i
B K

. Since the set K


i
is
compact, from this open cover, we can extract a
fnite sub-cover, that is, conclude that

.
1 1
i
ij
m
j
ij
m
j
i
G B B K
i i

= =

From =
i i
G A , we can now conclude
that:
=

=
ij
m
j
i
B A
i
1


implying that for every
ij
m
j
B y
i
1 =

, we have
i
A y
/ , that is, f(y) <r
i
. Thus, f(y) <r
i
for any
ij
m
j
B y
i
1 =

.
Since f is usc, it attains its maximum on ij
m
j
B
i
1 =

at some point y
max
. For this maximum value, we
therefore also have f(y
max
) <r
i
and therefore, there
exists a rational number q
i
such that f(y
max
) <q
i
<r
i
. Since the value f(y
max
) is the maximum, we
conclude that f(y) f(y
max
) for all other ij
m
j
B y
i
1 =

Thus, for all such y, we have f(y) <q
i
. This means
that

= =
ij
m
j
i
n
i
B q U f
i
1 1
,
.
It is easy to show that:
). , ( ,
1
1
1
i i
n
i
ij
m
j
i
n
i
K r U B q U
i
=
=
=



The theorem is proven.
Towards defning metrics on USC(X). We
have proven that the Lawson topology on USC(X)
USC(X) is metrizable, that is, that there exists
a metric with is compatible with this topology.
From the viewpoint of possible applications, it is
desirable to give an explicit description of such
a metric.
In Section 4, we used the point compactifcation
procedure to explicit describe a specifc metric
compatible with the Lawson topology on the set
36
Random Fuzzy Sets
F(X) of all closed (crisp) sets. Let us show that
for the class USC(X) of closed fuzzy sets, we
can defne a similar metric if we identify these
closed fuzzy sets with their hypographs, that is,
informally, areas below their graphs.
Formally, for every function ] 1 , 0 [ : X f ,
its hypograph Hyp (f ) is defned as:
Hyp (f ) ={(x, a) X [0,1]: f(x) a
Every hypograph is a subset of X [0,1]. Since
we consider usc functions, each hypograph is
closed.
Let HYP (X) denote the set of all hypographs of
all functions f USC(X). Then, ) ( f Hyp f is
a bijection from USC(X) to HYP (X); see Nguyen,
Wang, and Wei (in press). One can easily check that
the set HYP (X) is a closed subset of F(X [0,1]).
Note that since X is a LCHS space, the set F(X
[0,1]) is also a LCHS space, so the set F(X [0,1])
of all its closed subsets is a topological space with
a canonical Lawson topology. Thus, according to
Section 4, F(X [0,1]) has a compatible metric
H. Then the induced metric on HYP (X) is also
compatible with the induced Lawson topology (or
hit-or-miss topology) on HYP (X). The only things
that remains to be proven is that (HYP (X), H) is
homeomorphic to (USC (X), L), where L denotes
the Lawson topology on USC (X). This result was
proven in Nguyen and Tran (2007).
Finally, a Choquet theorem for USC (X) can be
obtained by embedding USC (X) into F(X [0,1])
via hypographs and using Choquet theorem for
random closed sets on X [0,1]. For more details,
see Nguyen et al. (in press).
TOWARDS PRACTICAL
APPLICATIONS
From the general theory to computationally
feasible practical applications. In the previous
sections, we described a general theory of random
fuzzy sets, a theory motivated by (and tailored
towards) potential applications. Our main motiva-
tion is to prepare the background for as wide a
range of applications as possible. Because of this
objective, we tried our best to make our theory
as general as possible.
Because of this same application objective, we
also tried our best to make the resulting techniques
as computation-friendly as possible. However, as
we will see in this section, the resulting problems
are computationally diffcult even in the simplest
cases. Because of this diffculty, in this chapter, we
will mainly concentrate on such simple cases.
Practical need for random sets and random
fuzzy sets: reminder. We have started this pa-
per with explaining the practical motivation for
random sets and random fuzzy sets.
This motivation is that due to measurement
uncertainty (or uncertainty of expert estimates),
often, instead of the actual values x
i
of the quan-
tities of interest, we only know the intervals
]
~
,
~
[
i i i i
x x x + - =
, where
i
x
~
is the known
approximate value and
i
is the upper bound on
the approximation error (provided, for measure-
ments, by the manufacturer of the measuring
instrument).
These intervals can be viewed as random
intervals, that is, as samples from the interval-
valued random variable. In such situations, instead
of the exact value of the sample statistics such as
covariance C[x,y], we can only have an interval
C[x,y] of possible values of this statistic.
The need for such random intervals is well
recognized, and there has already been a lot of re-
lated research; see, for example, Moeller and Beer
(2004). In this approach, the uncertainty in a vector
quantity = ) , , (
1 d
x x x R
d
is usually described
by describing intervals of possible values of each
of its components. This is equivalent to describing
the set of all possible values of x as a box (multi-
dimensional interval) ] , [ ] , [ 1
1
n
n
x x x x .
However, the resulting data processing problem
are often very challenging, and there is still a
large room for further development.
37
Random Fuzzy Sets
One such need comes from the fact that
uncertainty is often much more complex than
intervals. For example, for the case of several
variables, instead of an multi-dimensional inter-
val, we may have a more complex set S R
d
. In
such a situation, we need a more general theory
of random sets.
We also have mentioned that to get a more
adequate description of expert estimates, we need
to supplement the set S of possible values of the
quantity (or quantities) of interest with describing
the sets S
a
which contain values which are pos-
sible with a certain degree a. In such situations,
we need to consider random fuzzy sets.
What is needed for practical applications: an
outline of this section. As we have just recalled,
there is a practical need to consider random sets
and random fuzzy sets. In order to apply the
corresponding theory, we frst need to estimate
the actual distribution of random sets or random
fuzzy sets from the observations.
In other words, we need to develop statistical
techniques for random sets and random fuzzy
sets. In this section, we start with a reminder
about traditional number-valued and vector-valued
statistical techniques, and the need for extending
these techniques to random sets and random fuzzy
sets. Then, we overview the main sources of the
corresponding data uncertainty and techniques
for dealing with the corresponding uncertainty.
This will prepare us for the case study described
in the following section.
Traditional statistics: brief reminder. In
traditional statistics, we assume that the observed
values are independent identically distributed
(i.i.d.) variables , , ,
1 n
x x , and we want to fnd
statistics ) , , (
1 n n
x x C that would approximate
the desired parameter C of the corresponding
probability distribution.
For example, if we want to estimate the
mean E, we can take the arithmetic average
n
x x
n
n
x E
+ +
=

1
] [ . It is known that as n , this
statistic tends (with probability 1) to the de-
sired mean:
E x E
n
] [
. Similarly, the sample
variance
2
1
1
1
]) [ ( ] [ x E x x V
n i
n
i
n n
- =
=
-
tends to
the actual variance V, the sample covariance
]) [ ( ]) [ (
1
1
] , [
1
y E y x E x
n
y x C
n i n i
n
i
n
- -
-
=

=
between two different samples tends to the actual
covariance C, and so on.
Coarsening: a source of random sets. In
traditional statistics, we implicitly assume that the
values x
i
are directly observable. In real life, due to
(inevitable) measurement uncertainty, often, what
we actually observe is a set S
i
that contains the
actual (unknown) value of x
i
. This phenomenon
is called coarsening; see, for example, Heitjan
and Rubin (1991). Due to coarsening, instead
of the actual values x
i
, all we know is the sets

, , ,
1 n
X X that are known the contain the
actual (un-observable) values x
i
: x
i


X
i
.
Statistics based on coarsening. The sets
, , ,
1 n
X X are i.i.d. random sets. We want to
fnd statistics of these random sets that would
enable us to approximate the desired parameters
of the original distribution x. Here, a statistic
) , , (
1 n n
X X S transform n sets
n
X X , ,
1
into
a new set. We want this statistic ) , , (
1 n n
X X S
to tend to a limit set L as n , and we want
this limit set L to contain the value of the desired
parameter of the original distribution.
For example, if we are interested in the mean
E[x], then we can take n X X S
n n
/ ) (
1
+ + =
(where the sum is the Minkowskielement-
wisesum of the sets). It is possible to show that,
under reasonable assumptions, this statistic tends
to a limit L, and that E[x] L. This limit can be
viewed, therefore, as a set-based average of the
sets
n
X X , ,
1
.
Important issue: computational complexity.
There has been a lot of interesting theoretical
research on set-valued random variables and
corresponding statistics. In many cases, the cor-
38
Random Fuzzy Sets
responding statistics have been designed, and
their asymptotical properties have been proven;
see, for example, Goutsias, Mahler, and Nguyen
(1997); Li, Ogura, and Kreinovich (2002); Nguyen
(2006); and references therein.
In many such situations, the main obstacle
to a practical use of these statistics is that going
from random numbers to random sets drastically
increases the computational complexityhence,
the running timeof the required computations.
It therefore is desirable to come up with new,
faster algorithms for computing such set-values
heuristics.
Sources of uncertainty: general reminder.
Traditional engineering statistical formulas as-
sume that we know the exact values x
i
of the
corresponding quantity. In practice, these values
come either from measurements or from expert
estimates. In both case, we get only approxima-
tions
i
x
~
to the actual (unknown) values x
i
.
When we use these approximate values
i i
x x
~

to compute the desired statistical characteristics
such as E and V, we only get approximate valued
E
~
and V
~
for these characteristics. It is desirable to
estimate the accuracy of these approximations.
Case of measurement uncertainty. Measure-
ments are never 100 percent accurate. As a result,
the result
i
x
~
of the measurement is, in general,
different from the (unknown) actual value x of
the desired quantity. The difference
x x x
def
- =
~

between the measured and the actual values is
usually called a measurement error.
The manufacturers of a measuring device
usually provide us with an upper bound

for the
(absolute value of) possible errors, that is, with a
bound

for which we guarantee that


| | x
.
The need for such a bound comes from the very
nature of a measurement process: if no such bound
is provided, this means that the difference between
the (unknown) actual value x and the observed
value x
~
can be as large as possible.
Since the (absolute value of the) measurement
error x x x - =
~
is bounded by the given bound
, we can therefore guarantee that the actual
(unknown) value of the desired quantity belongs
to the interval ]
~
,
~
[ + - x x .
Traditional probabilistic approach to de-
scribing measurement uncertainty. In many
practical situations, we not only know the interval
] , [ -
of possible values of the measurement
error; we also know the probability of different
values x within this interval; see, for example,
Rabinovich (2005).
In practice, we can determine the desired prob-
abilities of different values of x by comparing
the results of measuring with this instrument
with the results of measuring the same quantity
by a standard (much more accurate) measuring
instrument. Since the standard measuring instru-
ment is much more accurate than the one use, the
difference between these two measurement results
is practically equal to the measurement error;
thus, the empirical distribution of this difference
is close to the desired probability distribution for
measurement error.
Interval approach to measurement uncer-
tainty. As we have mentioned, in many practical
situations, we do know the probabilities of differ-
ent values of the measurement error. There are
two cases, however, when this determination is
not done:
First is the case of cutting-edge measure-
ments, for example, measurements in fun-
damental science. When a Hubble telescope
detects the light from a distant galaxy, there
is no standard (much more accurate)
telescope foating nearby that we can use to
calibrate the Hubble: the Hubble telescope
is the best we have.
The second case is the case of measurements
on the shop foor. In this case, in principle,
every sensor can be thoroughly calibrated,
39
Random Fuzzy Sets
but sensor calibration is so costlyusu-
ally costing 10 times more than the sensor
itselfthat manufacturers rarely do it.
In both cases, we have no information about the
probabilities of x ; the only information we have
is the upper bound on the measurement error.
In this case, after performing a measurement
and getting a measurement result x
~
, the only
information that we have about the actual value
x of the measured quantity is that it belongs to the
interval = x ]
~
,
~
[ + - x x . In this situation, for
each i, we know the interval x
i
of possible values
of x
i
, and we need to fnd the ranges E and V of
the characteristics E and V over all possible tuples

i
x
i
x .

Case of expert uncertainty. An expert usu-
ally describes his/her uncertainty by using words
from the natural language, like most probably,
the value of the quantity is between 6 and 7, but
it is somewhat possible to have values between 5
and 8. To formalize this knowledge, it is natural
to use fuzzy set theory, a formalism specifcally
designed for describing this type of informal
(fuzzy) knowledge (Klir & Yuan, 1995; Nguyen
& Walker, 2006).
As a result, for every value x
i
, we have a
fuzzy set m
i
(x
i
) which describes the experts prior
knowledge about x
i
: the number m
i
(x
i
) describes
the experts degree of certainty that x
i
is a possible
value of the i-th quantity.
As we have mentioned earlier, an alternative
user-friendly way to represent a fuzzy set is by
using its a-cuts } ) ( : {
i i i
x x . In these terms,
a fuzzy set can be viewed as a nested family of
intervals
)] ( ), ( [ i
i
x x
corresponding to dif-
ferent level a.
Estimating statistics under fuzzy uncertain-
ty: precise formulation of the problem. In gen-
eral, we have fuzzy knowledge m
i
(x
i
) about each
value x
i
; we want to fnd the fuzzy set corresponding to
a given characteristic ) , , (
1 n
x x C y = . Intuitively,
the value y is a reasonable value of the characteristic
if ) , , (
1 n
x x C y = for some reasonable values x
i
,
that is, if for some values
n
x x , ,
1
, x
1
is reasonable,
and x
2
is reasonable, , and ) , (
1 n
x x C y = . If
we interpret and as min and for some (or)
as max , then we conclude that the correspond-
ing degree of certainty m(y) in y is equal to
}. ) , , ( : )) ( , ), ( max{min( ) (
1 1 1
y x x C x x y
n n n
= =

Reduction to the case of interval uncer-
tainty. It is known that the above formula (called
extension principle) can be reformulated as fol-
lows: for each a, the a-cut y(a) of y is equal to
the range of possible values of ) , , (
1 n
x x C when
) (
i i
x x for all i. Thus, from the computa-
tional viewpoint, the problem of computing the
statistical characteristic under fuzzy uncertainty
can be reduced to the problem of computing this
characteristic under interval uncertainty; see, for
example, Dubois, Fargier, and Fortin (2005).
In view of this reduction, in the following
text, we will consider the case of interval (and
set) uncertainty.
Estimating statistics under interval uncer-
tainty: a problem. In the case of interval uncer-
tainty, instead of the true values
n
x x , ,
1

, we only
know the intervals ] , [ , ], , [ 1
1 1
n
n n
x x x x = = x x
that contain the (unknown) true values of the
measured quantities. For different values x
i
x
i
,
we get, in general, different values of the corre-
sponding statistical characteristic ) , , (
1 n
x x C .
Since all values x
i
x
i
are possible, we conclude
that all the values ) , , (
1 n
x x C corresponding to x
i

x
i
are possible estimates for the corresponding
statistical characteristic. Therefore, for the interval
data
n
x x , ,
1
, a reasonable estimate for the cor-
responding statistical characteristic is the range
}. , , : ) , , ( { ) , , (
1 1 1 1 n n n n
x x x x C C x x x x =
def
We must therefore modify the existing statis-
tical algorithms so that they compute, or bound
these ranges.
40
Random Fuzzy Sets
Estimating mean under interval uncertain-
ty. The arithmetic average E is a monotonically in-
creasing function of each of its n variables
n
x x , ,
1

, so its smallest possible value E is attained when


each value x
i
is the smallest possible (x
i
=x
i
)and
its largest possible value is attained when x
i
=x
i

for all i. In other words, the range E of E is equal
to )] , , ( ), , , ( [ 1
1
n
n
x x E x x E . In other words,
) (
1
1
n n
x x E + + = and
) ( 1
1
n
n
x x E + + =
.
Estimating variance under interval uncer-
tainty. It is known that the problem of computing
the exact range ] , [ V V = V for the variance V over
interval data ]
~
,
~
[
i i i i i
x x x + - is, in general,
NP-hard; see, for example, (Kreinovich et al.,
2006), (Kreinovich et al., 2007). Specifcally, there
is a polynomial-time algorithm for computing V ,
but computing V is, in general, NP-hard.
Comment. NP-hard means, crudely speaking,
that no feasible algorithm can compute the exact
value of
V
for all possible intervals
n
x x , ,
1
;
see, for example, (Kreinovich et al., 1997).
In many practical situations, there are effcient
algorithms for computing
V
: for example, an
)) log( ( n n O time algorithm exists when no two
narrowed intervals
] , [
+ -
i i
x x
, where
n
x x
i
i i

- =
- ~
def

and
n
x x
i
i i

+ =
+ ~
def
, are proper subsets of one another,
that is, when
) , ( ] , [
+ - + -

/ j j i i
x x x x
for all i and j
(Dantsin et al., 2006).
What can be done if we cannot effectively
compute the exact range. As we have just
mentioned, the problem of computing statistical
characteristics under interval uncertainty is often
NP-hard which means, crudely speaking, that
we cannot effciently compute the exact range for
these characteristics.
A natural solution is as follows: since we can-
not compute the exact range, we should try to fnd
an enclosure for this range. Computing the range
) , , (
1 n
C x x of a function ) , , (
1 n
x x C based on
the input intervals x
i
is called interval computa-
tions; see, for example, (Jaulin et al., 2001).
Interval computations techniques: brief
reminder. Historically the frst method for com-
puting the enclosure for the range is the method
which is sometimes called straightforward
interval computations. This method is based on
the fact that inside the computer, every algorithm
consists of elementary operations (arithmetic
operations, min, max, etc.). For each elementary
operation f(a, b) if we know the intervals a and b
for a and b, we can compute the exact range f(a,
b). The corresponding formulas form the so-called
interval arithmetic. For example (see Box 4).
In straightforward interval computations, we
repeat the computations forming the program
C for computing ) , , (
1 n
x x C step-by-step, re-
placing each operation with real numbers by the
corresponding operation of interval arithmetic.
It is known that, as a result, we get an enclosure
) , , (
1 n
C x x Y for the desired range.
In some cases, this enclosure is exact. In more
complex cases (see examples below), the enclosure
has excess width.
There exist more sophisticated techniques for
producing a narrower enclosure, for example, a
centered form method. However, for each of these
techniques, there are cases when we get an excess
width. (Reason: as have mentioned, the problem
of computing the exact range is known to be NP-
hard even for population variance.)
]; , [ ] , [ ] , [ ]; , [ ] , [ ] , [ b a b a b b a a b a b a b b a a - - = - + + = +
)]. , , , ( max ), , , , ( min [ ] , [ ] , [ b a b a b a b a b a b a b a b a b b a a =
Box 4.
41
Random Fuzzy Sets
CASE STUDY: A BIOINFORMATICS
PROBLEM
In this section, we describe an example of a practi-
cal applications. This example was frst outlined
in (Kreinovich et al., 2007) and (Xiang, 2007).
For other applications, see (Kreinovich et al.,
2006), (Kreinovich et al., 2007) and references
therein.
Description of the case study. In cancer research,
it is important to fnd out the genetic difference
between the cancer cells and the healthy cells.
In the ideal world, we should be able to have a
sample of cancer cells, and a sample of healthy
cells, and thus directly measure the concentrations
c and h of a given gene in cancer and in healthy
cells. In reality, it is very diffcult to separate the
cells, so we have to deal with samples that contain
both cancer and normal cells. Let y
i
denote the
result of measuring the concentration of the gene
in i-th sample, and let x
i
denote the percentage
of cancer cells in i-th sample. Then, we should
have
i i i
y h x c x - + ) 1 (
(approximately equal
because there are measurement errors in measur-
ing y
i
).
Let us frst consider an idealized case in which
we know the exact percentages x
i
. In this case,
we can fnd the desired values c and h by solving
a system of linear equations
i i i
y h x c x - + ) 1 (
with two unknowns c and h.
It is worth mentioning that this system can be
somewhat simplifed if instead of c, we consider
a new variable
h c a - =
def
. In terms of the new
unknowns a and h, the system takes the following
form:
i i
y h x a + .
The errors of measuring y
i
are normally
i.i.d. random variables, so to estimate a and h,
we can use the Least Squares Method (LSM)
h a
n
i
i i
y h x a
,
1
2
min ) ( - +

=
. According to LSM, we
have
] [
] , [
x V
y x C
a = and ] [ ] [ x E a y E h - = , where
) ( ] [
1
1
n n
x x x E + + = ,
) ( ] [
1
1
n n
y y y E + + =
,
and , ]) [ ( ] [
2
1
1
1
x E x x V
i
n
i
n
- =
=
-
Once we know a =c
h and h, we can then estimate c as a + h.
The problem is that the concentrations x
i
come
from experts who manually count different cells,
and experts can only provide interval bounds on
the values x
i
such as x
i


[0.7, 0.8] (or even only
fuzzy bounds). Different values of x
i
in the cor-
responding intervals lead to different values of
a and h. It is therefore desirable to fnd the range
of a and h corresponding to all possible values
] , [ i
i i
x x x .
Comment. Our motivation for solving this
problem comes from bioinformatics, but similar
problems appear in various practical situations
where measurements with uncertainties are avail-
able and statistical data is to be processed.
Linear approximation. Let 2 ) (
~
i
i i
x x x + =
be the midpoint of i-th intervals, and let
2 / ) (
i
i
i
x x - = be its half-width. For a, we have
]). [ 2 2 ] [ (
] [ ) 1 (
1
x E a x a y E y
x V n x
a
i i
i
+ - -
-
=

We can use the formula h x E a y E + = ] [ ] [ to


simplify this expression, resulting in
i i i
n
i
x V n a
x a y - =
=
-
| |
1
] [ ) 1 (
1
,
where we denoted
h x a y y
i i i
- - =
def
and
] [x E x x
i i
- =
def
.
Since ] [ ] [ x E a y E h - = , we have:

a x E
n x
a
x
h
i i
- - =

1
] [
, so
.
1
i x
h
n
i
h
i
=

=
Prior estimation of the resulting accuracy.
The above formulas provide us with the accuracy
after the data has been processed. It is often desir-
able to have an estimate prior to measurements,
to make sure that we will get c and h with desired
accuracy.
The difference
i
y is a measurement error, so
it is normally distributed with 0 mean and standard
deviation (y) corresponding to the accuracy of
42
Random Fuzzy Sets
measuring y
i
. The difference
i
x is distributed
with 0 mean and standard deviation
] [x V
. For
estimation purposes, it is reasonable to assume
that the values
i
x are also normally distributed.
It is also reasonable to assume that the errors in x
i

and y
i
are uncorrelated, so the linear combination
i i
x a y - is also normally distributed, with
0 mean and variance
] [
2 2
x V a
y
+
. It is also
reasonable to assume that all the values
i
are
approximately the same:
i
.
For normal distribution x with 0 mean and
standard deviation , the mean value of | | is
equal to / 2 . Thus, the absolute value
| |
i i
x a y - of the above combination has a
mean value
] [ / 2
2 2
x V a
y
+
. Hence, the
expected value of
a
is equal to
] [
] [
2
2 2
x V
x V a
y
+

.
Since measurements are usually more accurate
than expert estimates, we have
] [
2
x V
y
<<
, hence
.
2
a
a
Similar estimates can be given for
h
.
Why not get exact estimates? Because in
general, fnding the exact range is NP-hard. Let
us show that in general, fnding the exact range for
the ratio C[x, y] / V[x] is an NP-hard problem; this
proof was frst presented in Kreinovich, Longpre,
Starks, Xiang, Beck, Kandathi, Nayak, Ferson,
and Hajagos (2007).
The proof is similar to the proof that comput-
ing the range for the variance is NP-hard (Ferson,
Ginzburg, Kreinovich, Longpre, & Aviles, 2005):
namely, we reduce a partition problem (known to
be NP-hard) to our problem. In the partition problem,
we are given m positive integers
m
s s , ,
1
, and we
must check whether there exist values e
i
{1,1} for
which 0
1
=
=
i i
m
i
s . We will reduce this problem to
the followi ng problem:
n =m +2, 0
1
= = =
m
y y , y
m+1
=1, y
m+2
=1, x
i
=[s
i
, s
i
]
for i m, x
m+1
=1, and x
m+2
=1. In this case:
E[y] =0, so

2
2
1
1
1
1
] [ ] [ ] , [
+ -
=
-
= - =
m n
n
i i
n
i
n
y E x E y x y x C .
Therefore, min ] [ ] , [ x V y x C if and only if
max ] [ x V .
Here,
2
1
2
1
1
2 2
1
1
1
2 ] [

+ =
=
+ +
+
=
+ i
m
i
m m
m
i
m
i
m
x x x V
.
Since
i i
s x | | , we always have:

+ =
=
+
2 ] [
2
1
1
1
0 i
m
i
m
s V x V
def

and the only possibility to have V[x]=V
0
is when
i i
s x = for all i and 0 =
i
x . Thus, V[x]=V
0
if
and only if the original partition problem has a
solution. Hence,
2
2
2
] [ / ] , [
+

=
i
s
x V y x C if and only
if the original instance of the partition problem
has a solution.
The reduction is proven, so our problem is
indeed NP-hard.
Comment 1. In this proof, we consider the case
when the values x
i
can be negative and larger than
1, while in bioinformatics, x
i
is always between
0 and 1. However, we can easily modify this
proof: First, we can shift all the values x
i
by the
same constant to make them positive; shift does
not change neither C[x, y] nor V[x]. Second, to
make the positive values 1, we can then re-
scale the values x
i
(
i i
x x ), thus multiplying
] [ / ] , [ x V y x C by a known constant.
As a result, we get new values ) / 1 (
2
1
K x x
i i
+ = ,
where
i
s K max
def
= , for which ] 1 , 0 [
i
x and the prob-
lem of computing ] [ / ] , [ x V y x C is still NP-hard.
Comment 2. Since we cannot compute the exact
range, what can we do to compute the more ac-
curate estimates for the range than those provided
by linear approximation? One possibility is to use
known algorithms to fnd the ranges for C[x, y]
and for V[x] (Kreinovich, Xiang, Starks, Longpre,
Ceberio, Araiza, Beck, Kandathi, Nayak, Torres,
43
Random Fuzzy Sets
& Hajagos, 2006; Kreinovich et al., 2007), and
then use the division operation from interval
arithmetic to get the interval that is guaranteed
to contain ] [ / ] , [ x V y x C .
ACKNOWLEDGMENT
This work was supported in part by NSF grants
HRD-0734825 and EAR-0225670, by Texas
Department of Transportation grant No. 0-5453,
and by the Japan Advanced Institute of Science
and Technology (JAIST) International Joint Re-
search Grant 2006-08. The authors are thankful
to the anonymous referees for their valuable
suggestions.
REFERENCES
Dantsin, E., Kreinovich, V., Wolpert, A., & Xiang,
G. (2006). Population variance under interval un-
certainty: A new algorithm. Reliable Computing,
12(4), 273-280.
Dubois, D., Fargier, H., & Fortin, J. (2005). The
empirical variance of a set of fuzzy intervals.
Proceedings of the 2005 IEEE International
Conference on Fuzzy Systems FUZZ-IEEE2005,
Reno, Nevada, May 2225, pp. 885-890.
Fell, J. M. G. (1961). The structure of algebras of
operator felds. Acta Mathematics, 101, 19-38.
Fell, J. M. G. (1962). A Hausdorff topology for the
closed sets of a locally compact non-Hausdorff
space. Proceedings of the American Mathematical
Society, 13, 472476.
Ferson, S., Ginzburg, L., Kreinovich, V., Longpre,
L., & Aviles, M. (2005). Exact bounds on fnite
populations of interval data. Reliable Computing,
11(3), 207-233.
Ferson, S., & Hajagos, J. (2007). Interval versions
of statistical techniques, with applications to en-
vironmental analysis, bioinformatics, and privacy
in statistical databases. Journal of Computational
and Applied Mathematics, 199(2), 418-423.
Gierz, G., Hofmann, K. H., Keimel, K., Lawson, J.
D., Mislove, M., & Scott, D. S. (1980). A compen-
dium of continuous lattices. Berlin, Heidelberg,
New York: Springer-Verlag.
Goutsias, J., Mahler, R. P. S., & Nguyen, H. T.
(Eds.) (1997). Random sets: Theory and applica-
tions. New York: Springer-Verlag.
Heitjan D. F., & Rubin, D. B. (1991). Ignorability
and coarse data. Ann. Stat., 19(4), 2244-2253.
Jaulin L., Kieffer, M., Didrit, O., & Walter,
E. (2001). Applied interval analysis. London:
Springer-Verlag.
Klir, G., & Yuan, B. (1995). Fuzzy sets and fuzzy
logic: Theory and applications. Upper Saddle
River, NJ: Prentice Hall.
Kreinovich V., Lakeyev, A., Rohn, J., & Kahl, P.
(1997). Computational complexity and feasibility
of data processing and interval computations.
Dordrecht: Kluwer.
Kreinovich, V., Longpre, L., Starks, S. A., Xiang,
G., Beck, J., Kandathi, R., Nayak, A., Ferson, S.,
& Hajagos, J. (2007). Interval versions of statistical
techniques, with applications to environmental
analysis, bioinformatics, and privacy in statistical
databases. Journal of Computational and Applied
Mathematics, 199(2), 418-423.
Kreinovich, V., Xiang, G., Starks, S. A., Longpre,
L., Ceberio, M., Araiza, R., Beck, J., Kandathi,
R., Nayak, A., Torres, R., & Hajagos, J. (2006).
Towards combining probabilistic and interval
uncertainty in engineering calculations: Algo-
rithms for computing statistics under interval
uncertainty, and their computational complexity.
Reliable Computing, 12(6), 471-501.
Li, S., Ogura, Y., & Kreinovich, V. (2002). Limit
theorems and applications of set valued and fuzzy
valued random variables. Dordrecht: Kluwer.
44
Random Fuzzy Sets
Matheron, G. (1975). Random sets and integral
geometry. J. Wiley.
Moller, & Beer, M. (2004). Fuzzy randomness:
Uncertainty in Civil Engineering and Computa-
tional Mechanics. Springer-Verlag.
Nguyen, H. T. (2006). An introduction to random
sets. Chapman and Hall/CRC Press.
Nguyen, H. T., & Kreinovich, V. (1996). Nested
intervals and sets: Concepts, relations to fuzzy
sets, and applications. In R.B. Kearfott& V. Kre-
inovich (Eds.). Applications of interval computa-
tions, pp. 245-290. Dordrecht: Kluwer.
Nguyen, H. T. & Tran, H. (2007). On a continu-
ous lattice approach to modeling of coarse data
in system analysis. Journal of Uncertain Systems,
1(1), 62-73.
Nguyen, H. T., & Walker, E. A. (2006). A frst
course in fuzzy logic (3rd edition). Chapman and
Hall/CRC Press.
Nguyen, H. T., Wang, Y., & Wei, G. (in press).
On Choquet theorem for random upper semi-
continuous functions. International Journal of
Approximate Reasoning.
Rabinovich, S. (2005). Measurement errors and
uncertainties: Theory and practice. New York:
Springer-Verlag.
Rockafellar, R. T., & West, J. B. (1984). Variational
systems an introduction. In Springer lecture notes
in mathematics, Vol. 1091, 154.
Wei, G., & Wang, Y. (in press). On metrization
of the hit-or-miss topology using Alexandroff
compactifcation. International Journal of Ap-
proximate Reasoning.
Xiang, G. (2007). Interval uncertainty in bioin-
formatics. Abstracts of the New Mexico Bioin-
formatics Symposium, Santa Fe, Mexico, March
89, p. 25.
45
Chapter III
Pattern Discovery in Gene
Expression Data
Grinne Kerr
Dublin City University, Ireland
Heather Ruskin
Dublin City University, Ireland
Martin Crane
Dublin City University, Ireland
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
ABSTRACT
Microarray technology
1
provides an opportunity to monitor mRNA levels of expression of thousands
of genes simultaneously in a single experiment. The enormous amount of data produced by this high
throughput approach presents a challenge for data analysis: to extract meaningful patterns, to evaluate
its quality, and to interpret the results. The most commonly used method of identifying such patterns
is cluster analysis. Common and suffcient approaches to many data-mining problems, for example,
Hierarchical, K-means, do not address well the properties of typical gene expression data and fail,
in signifcant ways, to account for its profle. This chapter clarifes some of the issues and provides a
framework to evaluate clustering in gene expression analysis. Methods are categorised explicitly in the
context of application to data of this type, providing a basis for reverse engineering of gene regulation
networks. Finally, areas for possible future development are highlighted.
INTRODUCTION
A fundamental factor of function in a living cell is
the abundance of proteins present at a molecular
level, that is, its proteome. The variation between
proteomes of different cells is often used to ex-
plain differences in phenotype and cell function.
Crucially, gene expression is the set of reactions
46
Pattern Discovery in Gene Expression Data
that controls the level of messenger RNA (mRNA)
in the transcriptome, which in turn maintains
the proteome of a given cell. The transcriptome
is never synthesized de novo; instead, it is main-
tained by gene expression replacing mRNAs that
have been degraded, with changes in composi-
tion brought about by switching different sets of
genes on and off. To understand the mechanisms
of cells, involved in a given biological process,
it is necessary to measure and compare gene
expression levels in different biological phases,
body tissues, clinical conditions, and organisms.
Information on the set of genes expressed, in
a particular biological process, can be used to
characterise unknown gene function, identify
targets for drug treatments, determine effects
of treatment on cell function, and understand
molecular mechanisms involved.
DNA microarray technology has advanced
rapidly over the past decade, although the concept
itself is not new (Friemert, Erfe, & Strauss, 1989;
Gress, Hoheisel, Sehetner, & Leahrach 1992). It
is now possible to measure the expression of an
entire genome simultaneously, (equivalent to the
collection and examination of data from thou-
sands of single gene experiments). Components
of the system technology can be divided into:
(1) Sample preparation, (2) Array generation
and sample analysis, and (3) Data handling and
interpretation. The focus of this chapter is on the
third of these.
Microarray technology utilises base-pair-
ing hybridisation properties of nucleic acids,
whereby one of the four base nucleotides (A, T,
G, C) will bind with only one of the four base
ribonucleotides (A, U, G, C: pairing =A U,
T A, C G, G - C). Thus, a unique sequence
of DNA that characterises a gene will bind to
a unique mRNA sequence. Synthesized DNA
molecules, complementary to known mRNA, are
attached to a solid surface, referred to as probes.
These are used to measure the quantity of specifc
mRNA of interest that is present in a sample (the
target). The molecules in the target are labelled,
and a specialised scanner is used to measure the
amount of hybridisation (intensity) of the target
at each probe. Gene intensity values are recorded
for a number of microarray experiments typically
carried out for targets derived under various
experimental conditions (Figure 1). Secondary
variables (covariates) that affect the relationship
between the dependent variable (experimental
condition) and independent variables of primary
interest (gene expression) include, for example,
age, disease, and geography among others, and
can also be measured.
Figure 1. mRNA is extracted from a transcriptome of interest, (derived from cells grown under precise
experimental conditions). Each mRNA sample is hybridised to a reference microarray. The gene intensity
values for each experiment are then recorded.
47
Pattern Discovery in Gene Expression Data
An initial cluster analysis step is applied to
gene expression data to search for meaningful
informative patterns and dependencies among
genes. These provide a basis for hypothesis test-
ingthe basic assumption is that genes, showing
similar patterns of expression across experimental
conditions, may be involved in the same underly-
ing cellular mechanism. For example, Alizadeh,
Eisen, Davis, Ma, Lossos, Rosenwald, Boldrick,
Sabet, Tran, Yu, Powell, Yang, Marti, Moore,
Hudson Jr, Lu, Lewis, Tibshirani, Sherlock, Chan,
Greiner, Weisenburger, Armitage, Warnke, Levy,
Wilson, Grever, Byrd, Botstein, Brown, and Staudt
(2000) used a hierarchical clustering technique,
applied to gene expression data derived from dif-
fuse large B-cell lymphomas (DLBCL), to identify
two molecularly distinct subtypes. These had gene
expression patterns, indicative of different stages
of B-cell differentiationgerminal centre B-like
DLBCL and activated B-like DLBCL. Findings
suggested that patients, with germinal centre B-
like DLBCL, had a signifcantly better overall sur-
vival rate than those with activated B-like DLBCL.
This work indicated a signifcant methodology
shift towards characterisation of cancers based
on gene expression, rather than morphological,
clinical and molecular variables.
BACKGROUND
The Gene Expression Dataset
Data are typically presented as a real-valued ma-
trix, with rows representing the expression of a
gene over a number of experiments, and columns
representing the pattern of expression of all genes
for a given microarray experiment. Each entry x
ij
is
the measured expression of a gene i in experiment
j, (Figure 1). The following terms and notations
are used throughout this chapter:
A gene/gene profle x is a single data item
(feature vector) used by the clustering al-
gorithm. It consists of d measurements, x
= (x
1
, x
2
, x
d
).
A condition y is a single microarray experi-
ment corresponding to a single column in
the gene expression matrix, y = (x
1
, x
2
,
x
n
)
T
, where n is the number of genes in the
dataset.
The individual scalar components of each
gene vector x
ij
represent the measured
expression of gene i under experimental
condition j.
Database Description URL
ArrayExpress Gene expression and hybridisation
array data repository
http://www.ebi.ac.uk/arrayexpress/#ae-main[0]
CellMiner Data from60 cancer cell lines based
on Affymetrix and cDNA microarray
data
http://discover.nci.nih.gov/cellminer
ExpressDB Collection of E. Coli and Yeast RNA
expression datasets
http://arep.med.harvard.edu/ExpressDB/
GEO Gene expression and hybridisation
array data repository
http://www.ncbi.nlm.gov/geo/
RAD Gene expression and hybridisation
array data repository
http://www.cbil.upenn.edu/RAD/
SMD Extensive collection of microarray data http://genome-www.stanford.edu/microarray
Table 1. Selection of publicly available dataset repositories
48
Pattern Discovery in Gene Expression Data
There are a number of publicly available dataset
repositories, which contain a wealth of microar-
ray datasets
2
: Table 1 provides a sample of these.
Typically, these repositories store data using the
minimum information about microarray experi-
ment (MIAME) standard (Brazma, Hingamp,
Quackenbush, Sherlock, Spellman, Stoeckert,
Aach, Ansorge, Ball, Causton, Gaasterland,
Glenisson, Holstege, Kim, Markowitz, Matese,
Parkinson, Robinson, Sarkans, Schulze-Kremer,
Stewart, Taylor, Vilo, & Vingron, 2001), which
allow researchers to replicate the experiments.
This allows analysts to compare gene expression
data from different laboratories effectively, based
on information about the microarrays used in
experiments, how these were produced, samples
obtained and mRNA extracted and labelled. Ad-
ditional information is also recorded on methods
used to hybridise the sample, scan the image and
normalise the data.
Characteristics of the Gene
Expression Dataset
Choice of the appropriate clustering technique
relies on the amount of information on the par-
ticular properties of gene expression data available
to the analyst, and hence the likely underlying
structure. The following data characteristics are
typical of the gene expression dataset:
Measurement accuracy of mRNA expres-
sion levels depends on the experimental design
and rigour. While design of experiments is not
a specifc focus of this chapter, a good design
minimises variation and has a focused objective
(Kerr & Churchill, 2001). Technical variation
between microarray slides depends on numerous
factors including experimental technique, instru-
ment accuracy for detecting signals, and observer
bias. Biological variation may arise due to dif-
ferences in the internal states of a population of
cells, either from predictable processes, such as
cell cycle progression, or from random processes
such as partitioning of mitochondria during cell
division, variation due to subtle environmental
differences, or ongoing genetic mutation (Raser &
OShea, 2005). Pre-processing techniques attempt
to remove technical variation while maintaining
interesting biological variation.
Many variables, both random and fxed,
(biological and technical), are associated with
microarray measurements. Data is thus intrinsi-
cally noisy and outliers in the dataset need to be
identifed and managed effectively. This usually
takes one of two forms, (i) outlier accommodation;
uses a variety of statistical estimation or testing
procedures, which are robust against outliers, (ii)
identifcation and decision on inclusion/exclusion,
used when outliers may contain key information
(Liu, Cheng, & Wu, 2002). Normalisation proce-
dures applied to gene expression data (Bolstad,
Irizarry, Astrand, & Speed, 2003), aim at minimis-
ing the effect of outliers (assuming these to be due
to experimental variation and thus undesirable).
Most manufacturers of microarrays, aware of
effects of optical noise and non-specifc binding,
include features in their arrays to measure these
directly: these measurements can be used in the
normalisation procedures. Note: although pre-
processing methods attempt to remove all noise
these may be only partially successful.
Missing values are common to microarray
data, and can be caused by insuffcient resolu-
tion in image analysis, image corruption, dust
or scratches on the slide, robotic method used to
create the slide, and so on, (Troyanskaya, Cantor,
Sherlock, Brown, Hastie, Tibshirani, Botstein, &
Altman, 2001). In general, the number of missing
values increases with the number of genes being
measured. Many clustering algorithms, used for
gene expression data, require a complete matrix of
input values. Consequently, imputation or missing
data estimation techniques need to be considered
in advance of clustering. The effect of missing
data on pattern information can be minimised
through pre-processing.
Commonly, missing values in the gene ex-
pression matrix are replaced by zeroes or by an
49
Pattern Discovery in Gene Expression Data
average expression level of the gene, (or row
average). Such methods do not, however, take
into account the correlation structure of the data
and more sophisticated options include K-Nearest
Neighbour (KNN) and Support Vector Decom-
position type methods. Troyanskaya et al. (2001)
note that KNN and SVD-based methods are more
effective than traditional methods of replacement,
with KNN being more robust as the number of
missing values increases.
Clustering algorithms that permit overlap
(probabilistic or fuzzy clusters) are typically
more applicable to gene expression data since: (i)
the impact of noisy data on clusters obtained is a
fundamental consideration in algorithm choice.
(The assumption is that noisy genes are unlikely
to belong to any one cluster, but are equally likely
to be members of several clusters): (ii) the underly-
ing principal of clustering gene expression data, is
that genes with similar change in expression for a
set of conditions are involved, together, in a simi-
lar biological function. Typically, gene products
(mRNA) are involved in several such biological
functions and groups need not be co-active under
all conditions. This gives rise to high variability
in the gene groups and/or some overlap between
them. For these reasons, constraining a gene to a
single cluster (hard clustering) is counter-intuitive
with respect to natural behaviour.
Additionally, methods that aim at a partial
clustering tend to be more suited to expression
data, with some genes or conditions not members
of any cluster (Maderia & Oliveira, 2000). Clus-
tering the microarray dataset can be viewed in
two ways: (i) genes can form a group which show
similar expression across conditions, (ii) condi-
tions can form a group which show similar gene
expression across all genes. It is this interplay of
conditions and genes that gives rise to bi-clusters,
whereby conditions and genes are simultaneously
grouped. Such partial clusterings, (or bi-clusters),
are defned over a subset of conditions and a
subset of genes, thus capturing local structure in
the dataset. Clearly, this allows: (i) noisy genes
to be left out, with correspondingly less impact
on the fnal outcome, (ii) genes belonging to no
clusteromitting a large number of irrelevant
contributions, (iii) genes not belonging to well-
defned groups. (Microarrays measure expres-
sion for the entire genome in one experiment,
but genes may change expression, independent
of the experimental condition, [e.g. due to stage
in cell cycle]. Forced inclusion of such genes in
well-defned but inappropriate groups may impact
the fnal structures found for the data).
Methods of Identifying Groups of
Related Genes
Cluster defnition is dependent on clearly defned
metrics, which must be chosen to refect the data
basis. Metric categories include:
Similarity-Based
The cluster is defned to be the set of objects in
which each object is closer, (or more similar), to
a prototype that defnes that cluster as opposed
to any other cluster prototype. A typical gene
expression cluster prototype is often the average
or centroid of all gene vectors in the cluster. The
similarity metric used affects the cluster produced.
Common measures include: (i) Euclidean distance,
(ii) Manhattan distance, and (iii) Squared Pearson
correlation distance (Quakenbush, 2001), with the
last being the most popular as it captures gene ex-
pression shape without regard to the magnitude
of the measurements. However, this distance mea-
surement is quite sensitive to outliers, although,
correlation, rather than distance, is inherently
more important for gene expression data. Take,
for example, two gene vectors X
1
=(1,2,3,4,5) and
X
2
=(3,6,9,12,15). These two profles result in a
Euclidean distance of 14.8323 and a Manhattan
distance of 30. The Pearson correlation distance
however is 0, refecting the fact that the two genes
are showing the same patterns of expression.
50
Pattern Discovery in Gene Expression Data
Density-Based
Clusters, in this instance, are based on dense re-
gions of genes, surrounded by less-dense regions.
Such methods are often employed when clusters
are irregular or intertwined, and when noise and
outliers are present (Sander, Ester, Kriegel, &
Xu, 1998). However, as each cluster is assumed
to have a uniform density, the method is not read-
ily applicable to gene expression data, as some
biological functions involve more gene products
than others. The high dimensionality also means
that density thresholds can be diffcult to defne
and expensive to compute.

Model-Based
Despite the convenience of similarity-based
measures, it can be biologically meaningless to
characterise a cluster through a cluster prototype,
such as the mean or centroid, as these may be
poorly representative of the cluster elements as
a whole. As a typical gene expression dataset is
large, noisy distortion of these prototypes may be
considerable, resulting in relatively uninformative
structures. In contrast, model-based techniques,
applied to expression space, consider the ft
of genes in a given cluster to the ideal cluster.
Concentrating on the strengths of the bi-clustering
approach, and following notation from Maderia
and Oliveira (2004), four types of model can be
identifed:
(i) Bi-clusters with constant values. A per-
fect cluster is a sub-matrix (I,J) of the gene
expression matrix (N,D), with all values
equal, x
i,j
= . The ideal bi-cluster is, of
course, rarely found in noisy gene expres-
sion data.
(ii) Bi-clusters with constant values on rows
or columns. A subset of the ideal or
constant bi-cluster model, and one which is
more realistic for gene expression data is a
sub-matrix with constant rows or columns.
For the former, rows have constant value in
a sub-matrix (I,J) given by a
ij
= +
i
or a
ij

=
i
, where is the typical bi-clus-
ter value and
i
is the row offset for i I.
Similarly, perfect bi-clusters with constant
columns can be obtained for a
ij
= +
j
or
a
ij
=
j
, where j J.
(iii) Bi-clusters with coherent values. From ii, a
combined additive model can be derived. In
this framework, a bi-cluster is a sub-matrix
(I,J), with coherent
3
values, based on the
model:
a
ij
= +
i
+
j

Eq. 1
(where ,
i
and
j
are as for (ii)). Similarly, the
multiplicative model assumes that a perfect bi-
cluster could be identifed using a
ij
=m' a'
i
b'
j
. Note: the additive form clearly follows for =
log(m'),
i
= log(a'
i
) and
j
= log(b'
j
).
The artifcial example in Figure 2(a) and (b)
illustrates this point. The sub-matrix is an ideal
bi-cluster found in a fctional dataset, where =1,
the offset for row 1 to 3 is
1
=0,
2
=2,
3
=4 re-
spectively, and the offset for columns 1 to 6 is

1
=0,
2
=1,
3
=2,
4
=4,
5
=1,
6
=-1 respectively.
The expression levels can be obtained from Eq.
1. Of course, when searching the dataset for a ft
to this model, the mean and offset parameters are
unknown and must be estimated from the data.
The schematic illustrates the coherent expression
profle over the six conditions. Similarly for the
multiplicative model where =1,
1
=1,
2
=2,
3
=5,

1
=1,
2
=2,
3
=4,
4
=6,
5
=3,
6
=1.5.
In reality, these perfect bi-clusters are, of
course, unlikely to occur, so each entry in the
sub-matrix can be regarded as having a residue
component (Cheng & Church, 2000):
r
ij
= +
i
+
j

ij
. Eq. 2
51
Pattern Discovery in Gene Expression Data
Thus, fnding bi-clusters is equivalent to
fnding sub matrices that minimise the average
residue.
(iv) Bi-clusters with coherent evolution. Local
structures, with coherent evolution across a
sub-matrix (I,J), can exist in the data regard-
less of the exact values. This occurs if there
is a pattern of co-regulation for a subset of
genes and conditions. Expression can occur
at different levels, so for example if two genes
are up-regulated by different degrees, (e.g.
due to a specifc condition), these are said
to experience coherent evolution.
Taking Figure 2(c) as an example. Gene 1 and
gene 2 are regulated, with similar periodicity,
while gene 3 shows alternated periodicity. Al-
though the genes are expressed at different levels,
each change in expression level is triggered by the
same condition. In a simple form, each gene can
be said to be exhibiting three states, down-regu-
lated, up-regulated or no change. Additional states
can be used, for example strongly up-regulated,
weakly up-regulated etc. depending on the detail
of the model required. Adding additional states,
of course, adds complexity to the model, and cut-
off points between states of regulation must be
considered carefully. The problem then reduces
to fnding profles that show consistent patterns
of regulation across all conditions.
CLUSTER ANALYSIS
Current Methods
With extensive choice of metric, structure, com-
pleteness etc. in cluster analysis it is useful to
consider a framework (Table 2) for performance
comparison. The taxonomy used is due to Jains,
Murty, and Flynn (1999).
Hierarchical Methods
Ever since the landmark paper of Eisen et al.
(1998), numerous clustering algorithms have been
applied to gene expression data. Predominantly
these have been hierarchical methods, (Higgins,
Shinghal, Gill, Reese, Terris, Cohen, Fero, Pollack,
van de Rijn, & Brooks, 2003; Khodursky, Peter,
Cozzarelli, Botstein, Brown, & Yanofsky, 2000;
Makretsov, Huntsman, Nielsen, Yorida, Peacock,
Cheang, Dunn, Hayes, van de Rijn, Bajdik, &
Gilks, 2004; Wen, Fuhrman, Michaels, Carr,
Smith, Barker, & Somogyi, 1998), due mainly to
ease of implementation, visualisation capability
and general availability.
The basic steps of a hierarchical clustering
algorithm include: (i) computation of the proximity
matrix of distances between each gene, (initially
Figure 2. Models in gene expression datasets. The matrix gives clusters found, where rows are gene
expression values across 6 experimental conditions (columns). X-axis indicates experimental condition
or time point, y-axis indicates gene expression level. Model forms are (a) Additive for rows and columns,
(b) Multiplicative for rows and columns and (c) Coherent evolution.
52
Pattern Discovery in Gene Expression Data
each is in a unique cluster of size one), (ii) searching
the proximity matrix for the two closest clusters,
(iii) merging these two clusters and updating the
proximity matrix, and (iv) repeating steps two and
three until all genes are in one cluster.
Such agglomerative clustering techniques
vary with respect to the (i) distance metric used
and the decision on cluster merger (that is linkage
choice as single, complete, average or centroid;
see Quackenbush [2001]). Typically output of a
hierarchical clustering algorithm is a dendogram,
representing nested patterns in the data and the
similarity level at which clusters are merged. The
choice of parameters affects both structure of, and
relationship between the clusters. Hierarchical
cluster structure works well for situations where
membership is crisp, but, despite their popularity
these methods may not be appropriate to capture
natural structures in gene expression data.
Nevertheless, some successes of clustering
conditions based on gene expression have been
reported. For example, Makrestov et al. (2004)
used gene expression profles, to determine
whether sub-types of invasive breast cancer could
be identifed, with a view to improving patient
prognosis. Hierarchical clustering successfully
identifed three cluster groups with signifcant
differences in clinical outcome. Similarly, a study
on renal cell carcinoma, Higgins et al. (2003),
found that hierarchical clustering led to segre-
gation of histologically

distinct tumour types
solely based on their gene expression patterns
(p. 925). These studies indicate that characterisa-
tion of tumours is potentially viable from gene
expression profling.
Hierarchical clustering algorithm properties
include location of complete clusters, forced
membership and large time-space complexity,
Common Clustering techniques
Gene
Membership
Cluster
Structure
Cluster Type Complete/Partial
Hierarchical (Eisen,
Spellman, Brown, &
Botstein, 1998)
Hard Hierarchical
(nested)
Similarity-Based Complete
K-Means (Tavazoie,
Hughes, Campbell,
Cho, & Church, 1999)
Hard No struc-
ture
Similarity-Based Complete
FCM (Gasch & Eisen,
1999)
Fuzzy No struc-
ture
Similarity-Based Complete
SOM (Golub, Slonim,
Tamayo, Huard,
Gaasenbeek, Mesirov,
Coller, Loh, Down-
ing, Caligiuri, Bloom-
feld, & Lander, 1999)
Hard Topological
Structure
Similarity and
Neighbourhood
kernal function-
based
Complete
Delta clusters (Cheng
& Church, 2000)
Shared Overlap Based on
Coherent Additive
Model
Partial
FLOC (Yang, Wang,
Wang, & Yu, 2003)
Shared Overlap Based on Coher-
ent Additive
Model
Partial
SAMBA (Tanay,
Sharan, & Shamir,
2002)
Shared Overlap Based on Coher-
ent Evolution
Model
Partial
Table 2. Popular clustering techniques applied to gene expression data. Partial (overlapping) clusters
are more relevant in this context.
53
Pattern Discovery in Gene Expression Data
but inclusion of noisy genes in the cluster can
affect the fnal grouping, (depending to a greater
or lesser extent on the linkage method and the
distance measure used). As algorithms are pro-
totype-based, further iterations exacerbate noise
effects. Given the distance metric basis, hierar-
chical techniques also tend to produce globular
structures.
Partitive Methods
In contrast to hierarchical algorithms, which create
clusters in a bottom up fashion resulting in nested
levels of clustering, partitive methods optimise a
function of given criteria, partitioning the entire
dataset and obtaining one cluster structure.
Partitive K-Means clustering (MacQueen,
1967) produces hard clustering with no structural
relationship between the individual clusters. The
main steps of the K-means algorithm are: (i)
Identifcation K prototype vectors for K clusters
in the dataset. (ii) Assignment of each gene to
a cluster based on its similarity to the cluster
prototype, (iii) computation of cluster proto-
types based on current genes in the cluster, (iv)
repeating steps two and three until convergence
criteria are satisfed. These may be for example
no (or minimal) reassignment of genes to new
clusters or for example minimal improvement in
optimisation of the criteria function. A typical
optimisation approach is to minimise the squared
error within a cluster:
) , (
1 1
j i
n
i
ij
k
j
q x d y C

= =
= Eq. 3
where q
j
is the vector representing the mean of
the cluster, x
i
is the vector representing the gene,
d(x
i
,q
j
) is a distance measure and y
ij
is a partition
element. Here y
ij
{0,1}, and y
ij
=1, indicates that
gene i is assigned to cluster j.
An example of use of the K-means method is
discussed in Tavazoie et al. (1999), and is based
on a yeast time-course gene expression dataset,
containing profles for more than 6000 genes,
with 15 time points (at 10 minute intervalsover
nearly two cell cycles), (Cho, Campbell, Winzeler,
Steinmetz, Conway, Wodicka, Wolfsberg, Gabri-
elian, Landsman, Lockhart, & Davis,, 1998). (This
work succeeded in identifying transcriptional co-
regulated genes in yeast). Unfortunately, initial
prototype vectors in K-Means usually have a large
impact on the data structures found. Prototype
vectors are often genes selected at random from
the dataset. Alternatively, Principal Component
Analysis can be used to project the data to a
lower dimensional sub-space and K-means is
then applied to the subspace (Zha, Ding, Gu, He,
& Simon, 2002). Whichever method is used in
practice to select prototype vectors, it is usually
the case that different initial prototypes are inves-
tigated to assess stability of the results, with the
best confguration, (according to the optimisation
criteria), used as output clusters.
Fuzzy Methods
As observed, (Section Characteristics of the Gene
Expression Dataset), multiple cluster membership
is more appropriate for gene expression data. The
Fuzzy C-Means (FCM) algorithm extends the
standard K-means algorithm, to the case where
each gene has a membership degree indicating
its fuzzy or percentage association with the
centroid of a given cluster. Typically, each gene
has a total membership value of 1, which is di-
vided proportionally between clusters according
to its similarity with the cluster means. A fuzzy
partition matrix Y, (of dimension NK, where K
is the number of clusters and N is the number of
genes), is created, where each element y
ij
is the
membership grade of gene i in cluster j and a
weighted version of Eq. 3 applies. At each itera-
tion, the membership value, y
ij
, and the cluster
center, k
j
, is updated by:
1
2
1
) (
) (
1
-
=

-
-
=
m
k
c c i
j i
ij
k x d
k x d
y
Eq. 4
54
Pattern Discovery in Gene Expression Data

=
=
=
N
i
m
ij
N
i
i
m
ij
j
y
x y
k
1
1

Eq. 5
where m>1 denotes the degree of fuzziness,
(everything else is as for Eq. 3). The iterations stop
when max < -
+
| |
1 k
ij
k
ij
j j , where is a termina-
tion criterion with value between 0 and 1, and k
is the number of iterations.
Given the usual constraint that membership
values of a gene must sum to unity, these values
should be interpreted with care. A large mem-
bership value does not indicate strength of
expression but rather reduced co-membership
across several clusters (Krishnapuram & Keller,
1993). Table 3 illustrates this idea for three clusters.
FCM was carried out on published yeast genomic
expression data (Gasch & Eisen, 2002; results
available at http://rana.lbl.gov/FuzzyK/data.html).
The membership values for gene B and gene D
are very different for cluster 21, although they
are approximately equidistant from the centroid
of the cluster. Similarly, gene C and gene D have
comparable membership values for cluster 4.
However, gene C is more typical than gene D.
With similar centroid distance measures, member-
ship value for gene B in cluster 21 is smaller than
membership value of gene A in cluster 46. These
values arise from the constraint that membership
values must sum to unity across all clusters, forcing
a gene to give up some of its membership in one
cluster to increase it in another. Listing the genes
of a cluster, based on membership values alone
is somewhat non-intuitive as it is not a measure
of their compatibility with the cluster. However,
if interpretation of the list in terms of degree of
sharing between clusters is of value.
The work of Gasch and Eisen (2002) on the
use of FCM in analysing microarray data looked
at clustered responses of yeast genes to environ-
mental changes. Groups of known functionally co-
regulated genes, and novel groups of co-regulated
genes, were found by this method, although missed
by both hierarchical and K-means methods.
Artifcial Neural Networks
Artifcial neural networks (ANN) mimic the idea
of biological neural networks, where links between
various neurons (nodes) can be strengthened or
weakened through learning. A number of ANN
types have been explored, with Self-Organising
Maps (SOM) (Kohonen, 1990) proving popular
for the analysis of gene expression, as these pro-
vide a fast method of visualising and interpreting
high dimensional data. The network maps the
high-dimension input gene vector into a lower
dimensional space. A SOM is formed by an input
layer of D nodes, (where D is the gene vector di-
mension), and an output layer of neurons arranged
in a regular grid (usually of 1 or 2 dimensions).
A vector, of the same dimension as the input
gene, references each node in the output layer.
Briefy, the mechanism involves: (i) initialisation
GID Cluster 4 Cluster 21 Cluster 46
Centroid Dist. Mem. Centroid Dist. Mem. Centroid Dist. Mem.
GENE1649X 10.691 0.002575 8.476 0.002002 3.864 0.482479
GENE6076X 6.723 0.009766 3.855 0.009341 6.33 0.007381
GENE5290X 6.719 0.007653 5.29 0.00515 8.024 0.005724
GENE2382X 7.725 0.007609 3.869 0.01782 6.279 0.010249
Table 3. Diffculties interpreting membership values for FCM. GENE1649X (Gene A), GENE6076X (Gene
B), GENE5290X (Gene C) and GENE2382X (Gene D). The table highlights distance to cluster centroid,
in terms of Euclidean distance, and the associated membership values of the gene.
55
Pattern Discovery in Gene Expression Data
of the prototype vectors of the output nodes, (ii)
training the network to fnd clusters in the data
(Genes are selected at random from the dataset
and the closest output neuron is identifed by
its prototype vector. Once an output neuron is
identifed, its topological neighbours are updated
to refect this. Training continues until the refer-
ence vectors satisfy a stopping criterion), and (iii)
Sequential application of all gene vectors to the
SOM, where only one output neuron fres upon
receiving an input gene vector. Members of a
cluster, represented by output neuron i, are the
set of genes, applied to the input neurons, causing
output neuron i to fre.
ANN techniques have been used in a number
of gene expression studies, including Tamayo,
Slonim, Mesirov, Zhu, Kitareewan, Dmitrovsky,
Lan-der, and Golub (1999) (to analyse haemato-
poietic differentiation); Toronen, Kolehmainen,
Wong, & Castren (1999) (to analyse yeast gene
expression data); and Golub et al. (1999) (to cluster
acute lymphoblastic leukaemia [ALL] and acute
myeloid leukaemia [AML]). The stopping crite-
rion of the SOM is crucial, since over-ftting to the
training dataset is a risk. A further disadvantage
is the amount of prior information needed. SOM
requires input parameters such as learning rate,
neighbourhood size and kernel function, as well
as topology of the map, (typically hexagonal or
square). The stability of results is also an issue,
as a particular gene vector can be found to cause
different output nodes to fre at different iterations,
(Jains et al. 1999). Furthermore, clusters produced
by the SOM are sensitive to choice of initial vec-
tors for the output neurons, and a sub-optimal
structure may result from a poor selection.
Search Based Methods
While methods considered so far focus on fnding
global structures in the data, local structures are
frequently of great interest. Cheng and Church
(2000) adapted work of Hartigan (1972) for gene
expression data, producing simultaneous clusters
of genes and conditions and an overall partial
clustering of the data. From Eq. 1 each value a
ij

of a sub-matrix can be defned from the typical
value within the bi-cluster a
IJ
, plus the offsets
for the row mean, a
iJ
-a
IJ
and column mean a
Ij
-
a
IJ
. Thus, each value in the sub-matrix should
(ideally) be:
a
ij
=a
ij
a
iJ
a
Ij
+a
IJ
Eq. 6
The Cheng and Church technique defnes the
Mean Residue Score (H) of a sub-matrix, based
on Eq. 2 and Eq. 6, such that:
2
,
) (
| || |
1
) , (


=
J j I i
ij
r
J I
J I H
Eq. 7
The algorithm carries out greedy iterative
searches for sub-matrices (I,J), which minimise
this function (Eq. 7), generating a large time cost,
as each row and column of the dataset must be
tested for deletion. (A sub-matrix is considered
a bi-cluster if its Mean Residue Score falls below
a user specifed threshold). A further overhead
is the masking of a bi-cluster with random
numbers once it is found, to prevent fnding the
same clusters repeatedly on successive iterations.
There is, nevertheless, high probability that this
replacement with random numbers affects the
discovery of further bi-clusters. To overcome this
random interference Yang et al. (2003) devel-
oped fexible overlapped bi-clustering (FLOC),
generalising the model of Cheng and Church to
incorporate null values.
For both the Cheng and Church (2000) al-
gorithm and the FLOC generalisation, K can
be specifed to be much larger than the desired
number of groups, without affecting the outcome,
as each row and column in the expression matrix
can belong to more than one bi-cluster, (Cheng
& Church, 2000; Yang et al., 2003). Selecting K
then reduces to selecting the percentage of the
bi-clusters with the best Mean Residue-score (Eq.
7). The cost is increased computation timethe
Cheng and Church algorithm fnds one bi-cluster
56
Pattern Discovery in Gene Expression Data
at a time while FLOC fnds all simultaneously.
However, a major additional strength of the bi-
clustering techniques is the minimal requirement
for domain knowledge. Also, as FLOC accounts
for null values in the dataset, the preliminary
imputation of missing values is not necessary.
Graph theoretic Methods:
A further approach, which is proving useful in the
analysis of large complex biological datasets, is
that of graph theory, (Aiello, Chun, & Lu, 2000;
Aittokallio & Schwitowski, 2006; Guillaume &
Latapy, 2006; Maslov, Sneppen, & Zaliznyak,
2004). A given gene expression dataset can be
viewed as a weighted bipartite graph, G= (V, U,
E), where V is the set of gene vertices, U is the set
of condition vertices, with V U = , and E is
the set of edges, with (u, v) E having a weight
a
uv
proportional to the strength of the expression
of gene v under condition u (Figure 3).
Analysis of the models involved focuses on
identifcation of similarly or densely connected
sub graphs of nodes and, of course, relies greatly
on the method used to defne the edge weights.
Clustering by graph network leads to similar is-
sues as before; (i) results are highly sensitive to
data quality and input parameters, (ii) predicted
clusters can vary from one method of graph
clustering to another. Clusters that share nodes
and edges of a graph networks, clearly overlap
and as noted in the section titled Characteristics
of the Gene Expression Dataset, are desirable for
gene expression interpretation.
The Statistical Algorithmic Method for Bi-
cluster Analysis (SAMBA) (Tanay et al., 2000),
uses a graphical approach and, unlike previous
bi-clustering techniques, fnds coherent evolution
in the data. An edge is defned to exist between
a gene node u and a condition node v if there is
signifcant change in expression of gene u under
condition v, relative to the genes normal level (a
non-edge exists if u does not change expression
under v). Each edge and non-edge is then weighted,
based on a log likelihood model, with weights:
0 log
) , (
>
v u
c
P
p
for edges, and 0
) 1 (
) 1 (
log
) , (
<
-
-
v u
c
P
P
for non-edges. Eq. 8
Here, P
(u,v)
is the fraction of random bipartite
graphs, with degree sequence identical to G, that
contain edge (u,v), and P
c
is a constant probability
assigned by the user. For
) , ( ) , (
max
v u V U v u c
P P

> ,
edges are taken to occur in a bi-cluster with equal
probability. Weights as determined by Eq. 8 are
assigned to the edges and non-edges in the graph.
A major strength of this method is that statisti-
cal signifcance of any sub graph is then simply
determined by its weight.
Cluster Evaluation and Comparison
Evaluation is not particularly well developed for
clustering analyses applied to gene expression
data, as very little may be known about the da-
taset beforehand. Many clustering algorithms are
designed to be exploratory; producing different
clusters according to given classifcation criteria
and will discover a structure, meaningful in that
context, which may yet fail to be optimal or even
biologically realistic. For example, for K-Means
the best structure is one that minimises the
sum of squared errors (MacQueen, 1967) while,
for the Cheng and Church algorithm (Cheng &
Church, 2000), it is that which minimises of the
Mean Residue-Score (Eq. 7). The two may not
Figure 3. Bipartite graph representing expression
for seven genes under fve conditionsedges
indicate a change in expression
57
Pattern Discovery in Gene Expression Data
be directly comparable, as the former highlights
global patterns in the data and the latter local
patterns. While larger deviations from the mean
may also correspond to large residue scores this
will not always be the case. For example, Figure
4 highlights a simple situation with three genes
in a cluster. According to the K-means criterion,
the within cluster distance is approximately 11.02,
based on Euclidean distance and centroid of the
cluster. The Mean Residue Score is 0. Reduc-
ing the scale of profle 1 by one third, (Figure
4(b)), decreases the within-cluster distance to
7.91, while increasing the Mean Residue Score
slightly to 0.0168. Both (a) and (b) are roughly
equivalent. Consequently, interpretation of clus-
ter results relies on some level of subjectivity as
well as independent validation and integration
of fndings. Subjective evaluation, even for low
dimensional data, is non-trivial at best, but be-
comes increasingly diffcult for high dimensional
gene expression data. Clearly, each technique will
fnd patterns even if these are not meaningful in
a biological context.
The benefts of the individual techniques, as
applied to gene expression data, were highlighted
in the last section. This section aims at providing
navigational guidelines for some degree of objec-
tive evaluation and comparison.
Determining the Correct Number of
Clusters
Cluster number, in the absence of prior knowledge,
is determined by whether a non-random structure
exists in the data. Limiting to a specifc number
of groups will bias the search, as patterns tend
to be well-defned only for the strongest signals.
Commonly, statistical tests for spatial randomness
test if a non-random structure exists, but identif-
cation of small, biologically meaningful clusters
remains non-trivial. This is particularly true of
standard methods, which fnd global structure,
but lack fne-tuning to distinguish local struc-
tures. Selection of the correct number of clusters
(K) thus is inherently iterative. Near optimal K
should clearly minimise heterogeneity between
groups, while maximising homogeneity within
groups, but determining the number of signifcant
clusters relies, not only on direct extraction (or
assessment) but also on appropriate hypothesis
testing. Direct methods are based on various of
criteria
4
. Nevertheless, improvement in terms of
identifcation of local clusters is slight. Specifc
tests include Gap Statistic (Tibshirani, Walther, &
Hastie, 2001), Weighted Discrepant Pairs (WADP)
(Bittner, Meltzer, Chen, Jiang, Seftor, Hendrix,
Radmacher, Simon, Yakhini, Ben-Dor, Sampas,
Dougherty, Wang, Marincola, Gooden, Lueders,
Glatfelter, Pollock, Carpten, Gillanders, Leja,
Dietrich, Beaudry, Berens, Alberts, & Sondak,
Figure 4. Gene expression profles for two equivalent clusters; cluster in (B) has Profle 1 scaled down
by one third
58
Pattern Discovery in Gene Expression Data
2000), and a variety of permutation methods
(Bittner et al., 2000; Fridlyand & Dudoit, 2001).
Since most involve bootstrapping, these methods
can be computationally very expensive. Compari-
son of methods for selecting the number of groups
is discussed by Milligan and Cooper (1985) and,
more recently, by Fridlyand and Dudoit (2001),
who note that no existing tests are optimal for
gene expression data.
Comparing Results from Clustering
Algorithms
Numerical measures of cluster goodness include
cluster cohesion (compactness or tightness),
that is how closely related genes in a cluster are,
while measures of cluster separation (isolation),
determine how distinct each cluster is. The Group
Homogeneity Function, is often used to measure
the association (distinctiveness) of genes within
and between groups.
Comparison with Metadata
Including biological function information in the
gene list for each cluster inevitably provides a
more complete picture of the dataset and of the
success of the technique. This information can
be used to validate the clusters produced, and a
number of functional annotation databases are
available. The Gene Ontology database (Ash-
burner, Ball, Blake, Botstein, Butler, Cherry,
Davies, Dolinski, Dwight, Epping, Harris, Hill,
Issel-Tarver, Kasarskis, Lewis, Matese, Rich-
ardson, Ringwald, Rubin, & Sherlock, 2000) for
example, provides a structured vocabulary that
describes the role of genes and proteins in all