Intelligent Data Analysis
Intelligent Data Analysis
and
=
X x P
x 1 ) ( ~ .
+ - - - =
X x
R
X x
P R
x x x R P C 1 ) ( ) ( ) (
2
1
1 : )
~
,
~
( ~ ~ ~
which can be rewritten as
>
- - =
) ( ) (
~ ~
~ ~
) ( ) ( 1 )
~
,
~
(
x x
X x
R P
R P
x x R P C . (1)
Before we prove that the measure C fulfls
requirements C1 - C4 let us explain what the
measure actually does. Figure 6 shows a fuzzy
set R
~
for requirements on the left hand side (we
used a continuous domain for better illustration),
the right triangular function is the fuzzy set P
~
for properties. The right term in (1) measures the
size of the grey area, which can be interpreted
as a degree to which the properties violate the
requirements. The size of the area is bounded by
11
Automatic Intelligent Data Analysis
the area underneath the membership function of
P
~
which we stipulated to be 1. That means that
the right term measures the proportion of proper-
ties P
~
that violate the requirements. Determining
the grey area is related to determining the size
of the fuzzy difference P
~
- R
~
, but the operator
( ) ) ( ), ( 1 ~ ~ x x t
P R
- for the fuzzy difference does
not match our semantics for properties and re-
quirements.
From (1) immediately follows:
Lemma 1: For the fuzzy set '
~
R with
{ }
) ( ), ( min ) ( ~ ~
'
~ x x x
P R R
=
holds C( P
~
, '
~
R ) =
C( P
~
, R
~
).
The Lemma shows that the requirements R
~
can be stripped down to the intersection '
~
R of
requirements and properties P
~
(using min as
t-norm) without changing the compatibility of
properties with requirements, see Figure 6.
Lemma 2: The measure C defned above conforms
to properties C1 - C4.
Proof:
1. Since the right term in (1) is obviously not
negative and the following holds
>
= -
) ( ) (
~
(*)
~ ~
~ ~
1 ) ( ) ( ) (
x x
X x X x
P R P
R P
x x x
, (4)
C meets requirement C1.
2. In case of = R P
~ ~
, we have equality at
(*) in (4) and therefore C2 is met.
3. If P R
~ ~
,then the set ) ( ) ( : ~ ~ x x X x
R P
> is
empty, the right term in (1) is 0 and therefore
C( P
~
, R
~
) =1.
4. If R
~
grows or P
~
shrinks the value of the
right term in (1) cannot decrease (equiva-
lently the size of the grey area in Figure 6
cannot shrink) therefore the value of C does
not decrease.
Furthermore, it turns out that C is a measure
of satisfability as defned by Bouchon-Meunier,
Rifqi, and Bothorel (1996) which is not surprising
since their notion of satisfability is very similar
to our understanding of compatibility.
Since we deal with a set of requirements and
properties, we end up having one match value
for each requirement/property pair. Requiring a
set of properties can formally be interpreted as a
logical conjunction of the individual requirements.
Given the assumption that all requirements are
equally important, we therefore propose to use a
t-norm to aggregate the individual match values.
We decided to use multiplication, as it is a strictly
monotonous operator. Strict monotony basically
means that the overall match value decreases with
any of the individual match values. Other operators
like minimum do not have this property. In case
of minimum, the overall match obviously is the
minimum of the individual matches. That means
Figure 6. Fuzzy sets of requirements, properties P
~
and the intersection '
~
R (we used a continuous domain
for better illustration). The grey area represents incompatibility of properties with requirements.
12
Automatic Intelligent Data Analysis
that all the match values apart from the minimum
can be increased without changing the overall
value. This is not the desired behaviour in our
case since many different sets of properties would
result in the same overall match value as long as
the minimal value is the same. So the proposed
measure for a multi-criteria match is
=
>
- - =
=
m
j
x x
X x
R P
j j
m
j
j R j P
j
j j
x x
R P C R P C
1
) ( ) (
~ ~
1
~ ~
) ( ) ( 1
)
~
,
~
( )
~
,
~
(
SPIDA AS AN ANALYTICAL
SERVICE
A model that has been created by SPIDA only
really becomes useful if it can be integrated into
an operational process. For example, the customer
relationship management system of a business
may require a model to classify customers for
marketing purposes like cross-selling, a network
management system may require a model to
monitor traffc patterns and identify potential
intrusions or a fnancial services system may
require a model for predicting tomorrows stock
exchange rates to support buy/sell decisions. In
order to support these types of operational sys-
tems, a model has to be taken out of the analytical
software environment and it has to be integrated
in order to continuously score new data.
Integration can be done loosely, by running
models out of the analytical environment and ap-
plying it to database tables that are in turn used
by operational systems. In the future, the modern
database system will offer analytical services and
store models in PMML thus integrating analytical
environment and data management. However,
sometimes this form of integration is not suf-
fcient, because operational system can be based
on proprietary database systems without common
interfaces or data must be scored on the fy in real
time without frst going into a database table.
Integration is made more diffcult by the fact
that from time to time, models have to be adjusted
in order to keep up with changes in the business
or the market. Such adaptations are usually done
manually by analysts, who produce new models
based on new data. Most of the times, the un-
derlying analysis algorithm stays the same, only
parameters change. For example, the weights of
a neural network or the rules in a fuzzy system
might be adapted.
In case of data analysis functionality in the
database and a PMML model, even algorithms
can be changed without compromising the integ-
rity of operational systems. For example, it might
turn out that a decision tree performs better on a
new data set than a support vector machine that
has been used before. By simply exchanging the
PMML description of the model in the database,
the new model can be integrated seamlessly.
However, such database systems are not standard,
yet, and furthermore the decision for the new
algorithm and its integration would still require
manual interaction.
The SPIDA wizard can automate this process
by selecting, confguring, and implementing new
algorithms without user interaction. The fnal
decision, if a new algorithm will be uploaded
into the operational system probably will remain
under human supervision, but the diffcult part of
creating it can mostly be achieved by SPIDA.
A schema summarising all scenarios in com-
bination with SPIDA is shown in Figure 7. Data
is fed into the operational system from a data
warehouse. The data analysis module conducts
data analysis for the operational system. This
way it infuences the systems performance or
the performance of related business processes,
because analysis results are usually used to make
business decisions. For example, the data analysis
module might be used to predict if a customer is
going to churn (i.e., cancel his contract) and per-
formance measures might be the relative number
of churners not being fagged up by the system
and the resulting loss of revenue. In this example,
13
Automatic Intelligent Data Analysis
a drop in system performance will be the result
of an inferior analytics model. Such a drop in
performance will be detected by the monitor and
trigger the generation of a new analysis model.
The easiest way to detect dropping performance
is by comparing against predefned thresholds.
The reasons for deterioration can be manifold.
In most cases, the dynamics of the market are
responsible, that is, the market has changed in
a way that cannot be detected by the analytical
model. Other reasons include changes in related
business processes and data corruption.
Whenever model generation is triggered by the
monitor, the wizard takes the user requirements
and the latest data to build a new model. Another
reason for building a new model is a change of
user requirements. For example, due to a change
of company policy or for regulatory reasons, a
model may now be required to be comprehensible.
If, for instance, a neural network had been used to
measure the credit line of a potential borrower, we
would be forced to switch to methods like deci-
sion trees or fuzzy systems which create a rule
base that can be understood by managers and the
regulator. This may be required to demonstrate
that the model conforms to an equal opportunities
act, for example.
When it comes to integrating modules with
operational systems, software engineers can
nowadays choose from a variety of techniques
that both depend on the operational system and
the module. Where some years ago, modules were
quite often especially tailored for the operational
system, nowadays, much more modular approach-
es are used that allow for using the same modules
in different operational systems on the one hand,
and replacing modules in operational systems on
the other hand. Apart from the database approach
where data analysis can directly be conducted
by the database and analysis models are defned
using a standard like PMML, other techniques
use libraries with standard interfaces (API) and
provide a respective interface in the operational
system or use even less coupled techniques like
Web services. For SPIDA, we decided to build
executable Java libraries which can easily be
used in either way.
Since we require a variety of different data
analysis techniques to cover as many applications
as possible, the starting point for the generation
of analysis modules are data analysis platforms
like SPIDA. The actual analysis model required
by an operational system typically uses only
a tiny fraction of such a platforms functional-
ity. For reasons of saving space and potentially
licensing costs we want to restrict the module to
the functions actually needed by the underlying
data analysis model or process.
In case of SPIDA, we are looking to create
analytical modules, which are pieces of software
independent from the SPIDA platform, cor-
responding to a particular data analysis model.
Figure 7. Monitoring of system performance and requirements and automatic creation and integration
of data analysis modules according to given requirements
14
Automatic Intelligent Data Analysis
Once extracted, the independent software module
can be used as a library with an exposed API by
the operational system. Changes in the analytical
model are completely hidden from the operational
system by the module, that is, the module can
easily be replaced if required due to changes in
requirements or a drop in performance.
At the heart of the integration of an analytic
module in SPIDA is a general framework with
which an operational system can communicate
and specify its requirements and obtain an analytic
module from SPIDA as a Java library.
First, the defnition of the analysis problem
and the user requirements are specifed in XML.
The SPIDA wizard parses the XML description
and identifes the corresponding analysis com-
ponents like data access, flters, and the actual
analysis blocks. A mapping is made between the
required components and the corresponding Java
classes of the platform. SPIDAs software extractor
component then fnds all the dependent classes
for this initial set of classes. The extractor uses
application knowledge and a host of other extrac-
tion techniques to fnd all the additional classes
required to make it executable and independent
from the platform. This complete set of all com-
ponent classes is turned into a Jar library with
suitable properties by SPIDAs Jar creator. This
Jar library is passed to the operational system,
which can access the API provided in the library
to control the execution of the analysis module. It
is also possible to instruct SPIDA to deliver the
Jar library in form of a stand-alone package that
is able to directly execute the included model for
specifc data analysis tasks.
If we think of data analysis services, we
can expect many different concurrent analysis
requests. Since each analysis process will create
an executable object, the size of dedicated Jar
libraries should be kept as small as possible in
order to allow for as many users as possible. Table
1 shows the compression rates for libraries created
for different data mining algorithms compared to
the whole SPIDA package, which range between
47 percent and 17 percent.
CASE STUDIES
The SPIDA platform discussed in this chapter is
a research demonstrator, and its main purpose
was to build and test methods for automatically
generating predictive models in a data analysis
process. Although the platform is constantly used
in our team for doing day-to-day data analysis ap-
plying it to operational processes requires further
work. In this section we briefy mention three suc-
cessful internal applications that build on SPIDA
technology, that is, methods for automatically
building models from data. Instead of making
the whole platform available in an operational
environment we decided to build smaller self-
contained applications targeting three business
areas: travel management for mobile workforces,
customer satisfaction analysis, and customer life
cycle modelling.
BT has a large mobile workforce for main-
taining its network, providing customers with
products and service and repairing faults. A dy-
namic scheduler is used to plan the working day
of each engineer and assign the right job to the
right person. In order to provide good schedules it
is important to have good estimates for job dura-
tions and inter-job travel times. We have built an
intelligent travel time estimation system (ITEMS)
that estimates inter-job times based on historic
Model Relative Size
Linear Regression 33%
NEFPROX 48%
Decision Trees 17%
Support Vector Machines 17%
Neural Networks 21%
NEFCLASS 47%
Table 1. Compression rates of dedicated Java
libraries for various data analysis algorithms
compared to whole SPIDA package
15
Automatic Intelligent Data Analysis
evidence (Azvine, Ho, Kay, Nauck, & Spott,
2003). An important part of ITEMS is a graphi-
cal interface where resource managers can study
patterns that lead to engineers arriving late for
jobs. The purpose of this system is to understand
where process issues lead to delays and how the
process can be improved. SPIDA technology is
used to learn patterns from data.
In order to be useful the rules must be simple
and sparse and they must be created automatically
in real-time for a specifc set of journeys for which
the user requires explanations (Nauck, 2003). We
had two algorithms available from SPIDA that
were suitable for this purpose: decision trees and
a neuro-fuzzy classifer (NEFCLASS). These
two algorithms and the required SPIDA con-
fgurations for creating small interpretable rules
based were implemented into ITEMS. Figure 8
displays a screen shot of typical situation while
using ITEMS. A user analyzes the travel pattern
of some technician and has clicked an arrow in
the displayed map. Additional windows display
detailed information about the corresponding job.
The top right window displays two fuzzy rules
matching the travel information of the selected
job. The user can see the degree of fulflment of
a rule and decide, if a particular rule is useful for
explaining the selected travel pattern, that is, why
the technician was late, early, or on time.
Another application based on SPIDA tech-
nology has been built in the area of customer
satisfaction analysis. Most service-oriented
companies regularly survey their customers to
learn about their satisfactions with the companys
customer services. Variables in a survey typically
display high degrees of interdependences which
is important to take into account when modelling
survey results. In our application we also had to
provide abilities for conducting what-if analyses
and target setting, that is, planning how to achieve
certain level of customer satisfaction. We identi-
fed Bayesian networks as the ideal model for this
scenario and used SPIDA technology to implement
an applicationiCSatthat fully automatically
learns a Bayesian network from data (Nauck,
Ruta, Spott, & Azvine, 2006). It allows the user
to identify drivers of customer satisfaction, to
run what-if scenarios for understanding how
satisfaction levels might change when certain
drivers change, and to simulate desired levels of
customer satisfaction for identifying targets for
drivers of satisfaction.
The fnal application based on SPIDA tech-
nology is used to model customer life cycles.
We have built an intelligent customer analytics
platform that can automatically learn hidden Mar-
kov models from customer event data (Azvine,
Nauck, Ho, Broszat, & Lim, 2006). The models
can be used to predict probabilities and timings
of relevant future customer events like churn,
complaints, purchases, and so on. The predictions
are used to control pro-active customer services,
like contacting a customer in time before a service
problem escalates and could lead to complaints
or churn.
CONCLUSION
As a new direction in automating data analysis,
we introduced the concept of using soft constraints
Figure 8. The ITEMS system with automatically
derived rules for explaining travel patterns
16
Automatic Intelligent Data Analysis
for the selection of an appropriate data analysis
method. These constraints represent the users
requirements regarding the analysis problem (like
prediction, clustering, or fnding dependencies)
and preferences regarding the solution.
Requirements can potentially be defned at any
level of abstraction. Expert knowledge in terms of
a fuzzy rule base maps high-level requirements
onto required properties of data analysis methods
which will then be matched to actual properties
of analysis methods.
As a result of our work, we introduced a new
measure for the compatibility of fuzzy require-
ments with fuzzy properties that can be applied
to other problems in the area of multi-criteria
decision making. The methods presented above
have been implemented as a wizard for our data
analysis tool SPIDA, which has been successfully
used to produce solutions to a variety of problems
within BT.
SPIDA can serve as a smart data analysis
service for operational systems. Where current
systems require manual intervention by data
analysis experts if a data analysis module needs
updating, our architecture automatically detects
changes in performance or requirements, builds
a new data analysis model, and wraps the model
into an executable Java library with standard API
that can easily be accessed by operational systems.
When building data analysis models, the wizard
takes into account high-level user requirements
regarding an explanation facility, simplicity of an
explanation, adaptability, accuracy, and so on. In
this way, we can vary the data analysis solution
as much as possible without violating require-
ments. Before, adaptive solutions only retrained
models based on new data, for example, an analy-
sis method like a neural network was selected
once and only adapted to new data. Switching
automaticallywithout user interactionto
another method, for example, to a support vector
machine was not possible, let alone considering
user requirements for the automatic selection of
the most suitable analysis method.
REFERENCES
Azvine, B., Ho, C., Kay, S., Nauck, D., & Spott, M.
(2003). Estimating travel times of feld engineers.
BT Technology Journal, 21(4), 33-38.
Azvine, B., Nauck, D.D., Ho, C., Broszat, K., &
Lim, J. (2006). Intelligent process analytics for
CRM. BT Technology Journal, 24(1), 60-69.
Bouchon-Meunier, B., Rifqi, M., & Bothorel,
S. (1996). Towards general measures of com-
parison of objects. Fuzzy Sets & Systems, 84(2),
143153.
Bernstein, A., Hill, S., & Provost, F. (2002). In-
telligent assistance for the data mining process:
An ontology-based approach (CeDER Working
Paper IS-02-02). New York: Center for Digital
Economy Research, Leonard Stern School of
Business, New York University.
Bernstein, A. & Provost, F. (2001). An intelligent
assistant for the knowledge discovery process
(CeDER Working Paper IS-01-01). New York:
Center for Digital Economy Research, Leonard
Stern School of Business, New York Univer-
sity.
Botia, J.A., Velasco, J.R., Garijo, M., & Skarmeta,
A.F.G. (1998). A generic datamining system.
Basic design and implementation guidelines. In
H. Kargupta & P.K. Chan (Eds.), Workshop in
distributed datamining at KDD-98. New York:
AAAI Press.
Botia, J.A., Skarmeta, A.F., Velasco, J.R., &
Garijo, M. (2000). A proposal for Meta-Learning
through a MAS. In T. Wagner & O. Rana (Eds.),
Infrastructure for agents, multi-agent systems,
and scalable multi-agent systems (pp. 226-233).
Lecture notes in artifcial intelligence 1887. Berlin:
Springer-Verlag.
Cornelis, C., Van der Donck, C., & Kerre, E.
(2003). Sinha-Dougherty approach to the fuzzi-
fcation of set inclusion revisited. Fuzzy Sets &
Systems, 134, 283295.
17
Automatic Intelligent Data Analysis
Gebhardt, J. & Kruse R. (1993). The context
model An integrating view of vagueness and
uncertainty. International Journal of Approximate
Reasoning, 9, 283314.
Nauck, D.D. (2003a). Fuzzy data analysis with
NEFCLASS. International Journal of Approxi-
mate Reasoning, 32, 103-130.
Nauck, D.D. (2003b). Measuring interpretability
in rule-based classifcation systems. Proceedings
of the 2003 IEEE International Conference on
Fuzzy Systems (FuzzIEEE2003), 1, pp. 196-201.
Saint Louis, MO: IEEE Press.
Nauck, D.D., Ruta, D., Spott, M., & Azvine, B.
(2006) A tool for intelligent customer analytics.
Proceedings of the IEEE International Confer-
ence on Intelligent Systems, pp. 518-521. London:
IEEE Press.
Nauck, D.D., Spott, M., & Azvine, B. (2003).
Spida A novel data analysis tool. BT Technology
Journal, 21(4), 104112.
Sinha, D. & Dougherty, E. (1993). Fuzzifcation
of set inclusion: theory and applications. Fuzzy
Sets & Systems, 55, 1542.
Spott, M. (2001). Combining fuzzy words.
Proceedings of IEEE International Conference
on Fuzzy Systems 2001. Melbourne, Australia:
IEEE Press.
Spott, M. (2005). Effcient reasoning with fuzzy
words. In S.K. Halgamuge & L.P. Wang (Eds.),
Computational intelligence for modelling and
predictions, pp. 117-128. Studies in Compuational
Intelligence 2. Berlin: Springer Verlag.
Spott, M. & Nauck, D.D. (2005). On choosing an
appropriate data analysis algorithm. Proceedings
of the IEEE International Conference on Fuzzy
Systems 2005, Reno, NV: IEEE Press.
University of California at Irvine (2007). UCI
Machine Learning Repository. Accessed March,
26, 2007 at http://www.ics.uci.edu/~mlearn/ML-
Repository.html.
Wirth, R., Shearer, C., Grimmer, U., Reinartz,
J., Schloesser, T.P., Breitner, C., Engels, R., &
Lindner, G. (1997). Towards process-oriented tool
support for knowledge discovery in databases.
Principles of Data Mining and Knowledge Dis-
covery. First European Symposium, PKDD 97,
pp. 243253. Lecture Notes in Computer Science
1263, Berlin: Springer-Verlag.
Witten, I.H. & Frank, E. (2000). Data mining:
Practical machine learning tools and techniques
with JAVA implementations. San Francisco: Mor-
gan Kaufmann Publishers.
Zadeh, L. A. (1965). Fuzzy sets. Information and
Control, 8, 338353.
18
Chapter II
Random Fuzzy Sets
Hung T. Nguyen
New Mexico State University, USA
Vladik Kreinovich
University of Texas at El Paso, USA
Gang Xiang
University of Texas at El Paso, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
ABSTRACT
It is well known that in decision making under uncertainty, while we are guided by a general (and ab-
stract) theory of probability and of statistical inference, each specifc type of observed data requires its
own analysis. Thus, while textbook techniques treat precisely observed data in multivariate analysis,
there are many open research problems when data are censored (e.g., in medical or bio-statistics),
missing, or partially observed (e.g., in bioinformatics). Data can be imprecise due to various reasons,
for example, due to fuzziness of linguistic data. Imprecise observed data are usually called coarse data.
In this chapter, we consider coarse data which are both random and fuzzy. Fuzziness is a form of im-
precision often encountered in perception-based information. In order to develop statistical reference
procedures based on such data, we need to model random fuzzy data as bona fde random elements,
that is, we need to place random fuzzy data completely within the rigorous theory of probability. This
chapter presents the most general framework for random fuzzy data, namely the framework of random
fuzzy sets. We also describe several applications of this framework.
19
Random Fuzzy Sets
FROM MULTIVARIATE STATISTICAL
ANALYSIS TO RANDOM SETS
What is a random set? An intuitive mean-
ing. What is a random set? Crudely speaking,
a random number means that we have different
numbers with different probabilities; a random
vector means that we have different vectors with
different probabilities; and similarly, a random set
means that we have different sets with different
probabilities.
How can we describe this intuitive idea in
precise terms? To provide such a formalization,
let us recall how probabilities and random vectors
are usually defned.
How probabilities are usually defned. To
describe probabilities, in general, we must have a
set of possible situations , and we must
be able to describe the probability P of different
properties of such situations. In mathematical
terms, a property can be characterized by the set
of all the situations which satisfy this property.
Thus, we must assign to sets A , the prob-
ability value P (A).
According to the intuitive meaning of
probability (e.g., as frequency), if we have
two disjoint sets A and A , then we must have
) ( ) ( ) ( A P A P A A P + = . Similarly, if we have
countably many mutually disjoint sets A
i
, we
must have:
) (
1 1
i
i
i
n
i
A P A P
= =
=
.
A mapping which satisfes this property is
called -additive.
It is known that even in the simplest situations,
for example, when we randomly select a number
from the interval =[0,1], it is not possible to
have a -additive function P which would be
defned on all subsets of [0,1]. Thus, we must
restrict ourselves to a class A of subsets of .
Since subsets represent properties, a restriction
on subsets means restriction on properties. If we
allow two properties F and F', then we should also
be able to consider their logical combinations F
& F', F F , and F which in set terms cor-
respond to union, intersection, and complement.
Similarly, if we have a sequence of properties F
n
,
then we should also allow properties nF
n
and
nF
n
which correspond to countable union and
intersection. Thus, the desired family A should
be closed under (countable) union, (countable)
intersection, and complement. Such a family is
called a -algebra.
Thus, we arrive at a standard defnition of
a probability space as a triple ) , , ( P A , where
is a set, A is a -algebra of subsets of , and
[ ] 1 , 0 : A P is a -additive function.
How random objects are usually defned.
Once a probability space is fxed, a random object
from the set C is defned as mapping : V C.
For example, the set of all d-dimensional vectors
is the Euclidean space R
d
, so a random vector is
defned as a map :
d
V R .
The map V must enable us to defne probabili-
ties of different properties of objects. For that,
we need to fx a class of properties; we already
know that such a class B should be a -algebra.
For each property of objects, that is, for each set
B B , it is natural to defne the probability P(B)
as the probability that for a random situation ,
the corresponding object V( ) belongs to the set
B (i.e., satisfes the desired property). In precise
terms, this means that we defne this probability
as P(V
-1
(B), where V
-1
(B) denotes
{ : V( ) B}.
For this defnition to be applicable, we must
require that for every set B from the desired -
algebra B, the set V
-1
(B) must belong to A. Such
mappings are called A-B-measurable.
How random vectors are usually defned. In
particular, for random vectors, it is reasonable to
allow properties corresponding to all open sets B
20
Random Fuzzy Sets
R
d
(like x
1
>0) and properties corresponding
to all closed sets B (such as x
1
0). So, we must
consider the smallest -algebra which contains
all closed and open sets. Sets from this smallest
subalgebra are called Borel sets; the class of all
Borel sets over R
d
is denoted by B( R
d
). So, a
random vector is usually defned as A-B
d
measur-
able mapping V:R
d
.
Alternatively, we can consider the vectors
themselves as events, that is, consider a probability
space (R
d
, B (R
d
), P
V
). In this reformulation, as
we have mentioned, for every set B, we get
P
V
(B) =P(V
-1
(B), so we can say that P
V
is a com-
position of the two mappings: P
V
=PV
-1
. This
measure P
V
is called a probability law of the
random vector.
How random vectors are usually described.
From the purely mathematical viewpoint, this is
a perfect defnition of a random vector. However,
from the computational viewpoint, the need to
describe P
V
(B) for all Borel sets B makes this
description impractical. Good news is that due to
-additivity, we do not need to consider all possible
Borel sets B, it is suffcient to describe P
V
(B) only
for the sets of the type
] , ( ] , (
1 d
x x - -
, that
is, it is suffcient to consider only the probabilities
that d d
x x & &
1 1
.
In the one-dimensional case, the probability
F(x) that x is called a (cumulative) distribution
function. Similarly, in the general d-dimensional
case, the probability
) , , (
1 d
x x F
that
,
1 1
x
,
and d d
x
is called a distribution function. In
these terms, the probability measures on B ( R
d
)
are uniquely characterized by their distribution
functions. This result was frst proved by Lebesgue
and Stieltjes and is thus called the Lebesgue-
Stieltjes theorem.
From random vectors to slightly more gen-
eral random objects. The above defnition of a
random vector does not use any specifc features
of a multi-dimensional vector space R
d
so it can
be naturally generalized to the case when the
set C of objects (possible outcomes) is a locally
compact Hausdorff second countable topological
space. Such spaces will be described as LCHS,
for short.
From random vectors to random sets. As
we have mentioned, in some real-life situations,
the outcome is not a vector by a set of possible
vectors. In this case, possible outcomes are subsets
of the vector space R
d
, so the set C of possible
outcomes can be described as the set of all such
subsetsa power set P ( R
d
).
Need to consider one-element sets. An im-
portant particular case of this general situation
is the case when we know the exact vectorx. In
this case, the set of all possible vectors is the cor-
responding one-element set {x}. So, one-element
sets must appear as possible sets.
Restriction to closed sets. From the practi-
cal viewpoint, it is suffcient to only consider
closed sets.
Indeed, by defnition, a closed set is a set
which contains all the limits of its elements. If
x
n
S and x x
n
, then, by defnition of a limit,
this means that whatever accuracy we choose, we
cannot distinguish between x and values x
n
for
suffciently large n. Since x
n
S and x is undis-
tinguishable from x
n
, it makes sense to conclude
that x also belongs to Sthat is, that S is indeed
a closed set.
Since one-element sets are closed, this restric-
tion is in accordance with the above-mentioned
need to consider one-element sets. (In a more
general framework of a LCHS space, the require-
ment that all one-point sets be closed is one of the
reasons why we impose the restriction that the
topology must be Hausdorff: for Hausdorff topo-
logical spaces, this requirement is satisfed.)
With this restriction, the set C of possible
outcomes is the class of all closed subsets of the
space R
d
; this class is usually denoted by F (R
d
),
or simply F, for short.
21
Random Fuzzy Sets
It is worth mentioning that the decision to
restrict ourselves to closed sets was made al-
ready in the pioneering book (Mathron, 1975)
on random sets.
We need a topology on F. To fnalize our
defnition of closed random sets, we must specify a
-feld on F. To specify such a feld, we will follow
the same idea as with random vectorsnamely, we
will defne a topology on F and then consider the
-feld of all the Borel sets in this topology (i.e.,
the smallest -feld that contains all sets which are
open or closed in the sense of this topology.
In his monograph (Mathron, 1975), Mathron
described a natural topology that he called a hit-
or-miss topology. (It is worth mentioning that this
topology was frst introduced in a completely dif-
ferent problem: the construction of the regularized
dual space of a C
*
-algebra [Fell, 1961, 1962]).
Mathrons motivations and defnitions work
well for random sets. However, they are diffcult
to directly generalize to random fuzzy sets. Since
the main objective of this paper is to describe
and analyze random fuzzy sets, we will present
a different derivation of this topology, a deriva-
tion that can be naturally generalized to random
fuzzy sets.
This alternative defnition is based on the
fact that on the class of closed sets, there is a
natural order A B. The meaning of this order
is straightforward. If we have a (closed) set A of
possible vectors, this means that we only have
partial information about the vector. When we
gain additional information, this enables us to
reduce the original set A to its subset B. Thus, A
B means that the set B carries more information
about the (unknown) vector than the set A.
It turns out that in many important situations,
this order enables us to describe a natural topology
on the corresponding ordered set. Specifcally,
this topology exists when the corresponding order
forms a so-called continuous lattice. To describe
this topology in precise terms, we thus will need
to recall the basic notions of lattice theory and
the defnition of a continuous lattice.
Basics of lattice theory. Lattices are a par-
ticular case of partially ordered sets (also knows
as posets); so, to defne the lattice, we must frst
recall the defnition of a poset. A poset is defned
as a set X together with a relation which satisfes
the following properties:
(i) is refexive, that is, x x for all x X;
(ii) is anti-symmetric, that is, for all x, y X,
x y and y x imply that x =y;
(iii) is transitive, that is, for all x, y, z X, if
x y and y z, then x z.
An element u X is called an upper bound for
a pair x, y if x u and y u. Dually, an element
v X is called a lower bound for a pair x, y if v
x and v y. These concepts can be naturally
extended to arbitrary collections of elementsA
X: an element u X is an upper bound for A if
x u for all x A; an element v X is a lower
bound for A if v x for all x A.
The least upper bound is exactly what it sounds
like: the least of all the upper bounds. Similarly,
the greatest lower bound is the greatest of all the
lower bounds. In precise terms, an element u
X is called the least upper bound of the set A if it
satisfes the following two conditions:
1. u is an upper bound for the set A (i.e., x u
for all x A); and
2. if w is an upper bound for the set A, then u
w.
The least upper bound of a set A is also called
its join or supremum and is usually denoted by
A or sup A.
Similarly, an element v X is called the
greatest lower bound of the set A if it satisfes
the following two conditions:
1. v is a lower bound for the set A (i.e., v x
for all x A); and
2. if w is an lower bound for the set A, then w
v.
22
Random Fuzzy Sets
The greatest lower bound of a set A is also
called its meet or infmum and is usually denoted
by A or inf A.
For two-element sets, } , { b a is usually de-
noted by b a and } , { b a by b a .
A poset X is called a lattice if b a and b a
exist for all pairs a,b X. A poset X is called a
complete lattice if both A and A exist for
any set A X.
Examples of lattices. The set R of all real
numbers with a standard order is a lattice, with
) , max( b a b a = and ) , min( b a b a = . This set
is not a complete lattice, since it does not have
a join R: no real number is larger than all the
others.
A slight modifcation of this example makes
it a complete lattice: namely, the set of all real
numbers from an interval
] , [ x x
is a complete
lattice.
The set of all n- dimensional vectors
) , , (
1 n
a a a =
with a component-wise order
n n
b a b a b a & &
1 1
is also a lattice, with
) , max( ) (
i i i
b a b a = and
) , min( ) (
i i i
b a b a =
.
Another example of a lattice is the set of all the
functions, with )) ( ), ( max( ) )( ( x g x f x g f = and
)) ( ), ( min( ) )( ( x g x f x g f = .
Continuous lattices and Lawson topology.
In some posets, in addition to the original rela-
tion < (smaller), we can defne a new relation <
whose meaning is much smaller. This relation
is called way below.
The formal defnition of < requires two aux-
iliary notions: of a directed set and of a dcpo. A
set A X is called directed if every fnite subset
of A has an upper bound in A. Of course, in a
complete lattice, every set is directed.
A poset X is called a directed complete partial
order (dcpo), if each of its directed subsets D has
a supremum D . In particular, every complete
lattice is a dcpo.
We say that x is way below y (x <y) if for every
directed subset D for which y D, there exists
an element d D such that x d.
In particular, a one-element set D ={y} is
always directed, and for this set, we conclude that
x ythat is, that way below indeed implies .
Another simple example: for a natural order on the
set of real numbers, x <y simply means
x <y.
From the common sense viewpoint, we expect
that if x is way below y and z is even smaller than
x, then z should also be way below y. This is indeed
true for the above defnition: indeed, if x <y and z
x, then z <y, then for every D, we have x d for
some d and hence z x d, that is, z d for that
same element d D. Similarly, we can proof all
three statements from the following lemma.
Lemma 1. Let (L,) be a dcpo. Then, for any
u, x, y, z in L, we have:
(i) x < y implies x y;
(ii) u x < y zimplies u < z;
(iii) if x < z and y< z, then x y <z.
For example, to prove (iii), for every directed
set D, we must prove that z D implies that x
y d for some d D. Indeed, from x <<z and y
<<z, we conclude that x d
x
and y d
y
for some
d
x
, d
y
D. Since D is a directed set, there exists
an element d D which is an upper bound for
both d
x
and d
y
, that is, for which d
x
d and d
y
d. From x d
x
and d
x
d, we conclude that x d
and similarly, that y d, so d is an upper bound
for x and y. By defnition of the least upper bound
x y, it must be smaller than or equal than any
other upper bound, hence x y d. The state-
ment is proven.
We can defne a topology if we take, as a sub-
base, sets {y X : y <<x} and } : { y x X y /
for all x X. In other words, as open sets for
this topology, we take arbitrary unions of fnite
intersections of these sets. This topology is called
a Lawson topology.
It is worth mentioning that for the standard or-
der on real numbers, the sets {y X : y <<x}={y :
y <x} and } : { y x X y / ={y : y >x} are indeed
open, and the corresponding topology coincides
with the standard topology on the real line.
23
Random Fuzzy Sets
} : { y x X y /
There is an important reasonably general case
when the Lawson topology has useful properties:
the case of a continuous lattice. A complete lattice
X is called a continuous lattice if every element x
X is equal to the union of all the elements which
are way below it, that is, if } : { x y X y x << =
for all x X. It is known that on every continuous
lattice(X,), the Lawson topology is compact and
Hausdorff; see, for example, Gierz, Hofmann,
Keimel, Lawson, Mislove, and Scott (1980).
Final defnition of a (closed) random set and
the need for further analysis. In our analysis of
random sets, we will use the Lawson topology to
describe the -algebra of subsets of Fas the class
of all Borel sets in the sense of this topology.
From the purely mathematical viewpoint, this
is a (somewhat abstract but) perfect defnition.
However, since our objective is to make this defni-
tion applicable to practical problems, we need to
frst reformulate this general abstract defnition in
more understandable closer-to-practice terms.
THE LAWSON TOPOLOGY OF
CLOSED SETS
First try. Let us describe what these notions lead
to in the case of closed sets. For the class F of
closed sets, there is a natural ordering relation :
a set inclusion F F'. The corresponding poset
(F,) is indeed a complete lattice: indeed, for
every family F
i
(i I) of closed sets, there exist
both the infmum and the supremum:
}; : { } : { I i F I i F
i i
= F
}. : { of closure the } : { I i F I i F
i i
=
The resulting complete lattice is not con-
tinuous. Let us show that with as the ordering
relation , F is not continuous. Specifcally, as
we will show, that, for example, in the one-di-
mensional case R
d
=R, the defnition F= {G
F : G <<F} of a continuous lattice is violated
for F =R.
For that, we will prove that for every two sets
F and G, G<<F implies that G = . We will prove
this by reduction to a contradiction. Let us assume
that G and G <<F. Since the set G is not empty,
it contains an element x G. For this element x,
the one-point set S ={x} is a closed subset of G:
S G. By our frst-try defnition of , this means
that S G. We have already mentioned that if S
G and G <<F, then S << F. By defnition of
the way below relation <<, this means that if
F D, then there exists an element d D for
which S d. In particular, we can take as D the
family{d
n
}
n
, where for every positive integer n,
( ] [ ) + + - - = , ,
1 1
n n n
x x d
def
. It is easy to see that
) , , max(
1 1 k k
n n n n
d d d
= , so the family D is
indeed directed. The union
n
d
n
of these sets is
equal to ) , ( ) , ( + - x x . Thus, the closure D
of this union is the entire real line R. Since F is a
subset of the real line, we have F D. However,
for every d
n
D, we have x d
n
and thus,
n
d S /
. The contradiction shows that a non-empty set G
cannot be way below any other set F. Therefore,
the only set G for which G <<R is an empty set,
so {G F :G <R}=
R.
Correct defnition. Fortunately, a small
modifcation of the above defnition makes F a
continuous lattice. Namely, as the desired ordering
relation on the class F of all closed sets, we can
consider the reverse inclusion relation .
It is easy to show that (F,) is a complete lat-
tice: indeed,
}; : { of closure the } : { I i F I i F
i i
= F
}. : { } : { I i F I i F
i i
=
Let us prove that (F,) is a continuous lattice.
By defnition, a continuous lattice means that for
every closed set F F, we have F = {G F : G
<<F} for every set F F. Since G <<F implies
G F, that is, G F, we thus have {G F : G
<<F} F. So, to prove the desired equality, it is
suffcient to prove that F {G F:G <<F}.
24
Random Fuzzy Sets
We have already mentioned that in the lattice
(F,), the union is simply the intersection of
the corresponding sets, so the desired property
can be rewritten as } : { F G G F << F . It
turns out that it is easier to prove the equivalent
inclusion of complements, that is, to prove that
} , : { F G G G F
c c
<< F (where F
c
denotes
the complement of the set F).
There is no easy and intuitive way to im-
mediately prove this result, because the notion
of way below is complex and is therefore not
intuitively clear. So, to be able to prove results
about this relation <<, let us reformulate it in an
equivalent more intuitive way.
Lemma 2. For X = R
d
(and, more generally,
for an arbitrary locally compact Hausdorff second
countable topological space X), for F,G F (X),
F << G if and only if there exists a compact set
K X such that F
c
K G
c
.
Proof
1. Suffciency. Let F,G F(X) and let K be a
compact set such that F
c
K G
c
. Let G D
for some directed family D, that is, let G D.
We already know that is simply an intersec-
tion, that is, G D. In terms of complements,
we get an equivalent inclusion G
c
{d
c
: d
D}. Since K G
c
, we conclude that K {d
c
: d
D}. Complements d
c
to closed sets d D are
open sets. Thus, the compact K is covered by a
family of open sets d
c
, d D. By defnition of a
compact set, this means that we can select a fnite
subcover, that is, conclude that
c
n
c
F F K
1
for some closed sets F
i
D. Since F
c
K, we thus
have
c
n
c c
F F F
1
, that is, equivalently,
n
F F F
1
, that is,
n
F F F
1
.
Since sets
n
F F , ,
1
belong to the directed
family D, this family must also contain an upper
bound d D. By defnition of the least upper
bound, we have d F F
n
1
, hence F d.
Thus, indeed, G << F.
2. Necessity. Let F << G. Since the underly-
ing topological space X is locally compact,
each point x G
c
has a compact neighborhood
c
x
G Q such that its interior
x
Q
contains x. We
thus have } : {
c
x
c
G x Q G =
or, equivalently,
}. : ) {( } : ) {(
c c
x
c c
x
G x Q G x Q G = =
Finite unions of closed sets
c
x
Q ) (
form a
directed family D, for which the union is the
same set G. Since G D and F << G, we con-
clude (by defnition of the way below relation)
that } , , 1 : ) {( n i Q F
c
x
i
= , x
i
G
c
. Thus,
c
x
n
i
i
Q F ) (
1
or, equivalently, ) (
1
i
x
n
i
c
Q F
=
.
Therefore,
c
x
n
i
c
G Q F
i
=1
. Since each set
i
x
Q
is compact, their union is also compact, so we have
the desired inclusion F
c
K G
c
with
i
x
n
i
Q K
1 =
=
.
The lemma is proven.
Proof that (F(X), ) is a continuous lat-
tice. Let us now use this Lemma 2 to show
that (F(X), ) is a continuous lattice. We have
already mentioned that to prove this fact, we must
prove that } , : { F G G G F
c c
<< F for every
closed set F F(X). Indeed, let F be a closed
set. Since X is locally compact, for every point
x from the open set F
c
, there exists a compact
neighborhood K
x
F
c
such that its interior x K
contains x. The complement
c
x K A ) (
= to this
(open) interior is a closed set A F(X), for which
c
x
x
c
F K K A x =
. So, due to Lemma 2,
A <<F. In other words, if x F
c
, then there is
an A F such that x A
c
and A <<F. So, we
conclude that F
c
{ G
c
: G F,G <<F} that is, that
F(X) ) is indeed a continuous lattice.
Lawson topology on the class of all closed
sets. Since F(X) ) is a continuous lattice, we
can defne Lawson topology for this lattice.
For this lattice, let us reformulate the general
abstract notion of the Lawson topology in more
understandable terms. In the following text, we
will denote the class of all compact subsets of the
space X by, K(X) and the class of all open subsets
25
Random Fuzzy Sets
of X by O(X). When the space X is clear from the
context, we will simply write K and O.
Theorem 1. For every LCHS X,the Lawson
topology on the continuous lattice (F(X), ) is
generated by the subbase consisting of subsets
of F of the form } : { = = K F F
K
F F
def
and
} : { = G F F
G
F F
def
, where K K K K
and G O.
Comment. The class } : { = K F F F is
the class of all random sets F which miss the set
K, and the class {F F : F G } is the class
of all random sets F which hit the set G. Because
of this interpretation, the above topology is called
the hit-or-miss topology (Mathron, 1975). So,
the meaning of Theorem 1 is that the Lawson
topology on (F(X), ) coincides with the hit-or-
miss topology.
Proof. By definition, the Lawson topology
has a subbase consisting of sets of the form
{F' F : F << F'} and } : { F F F / F for
all F F, where is . It turns out that to
prove the theorem, it is suffcient to reformulate
these sets in terms of .
1. Clearly:
}. : { } : { = /
c
F F F F F F F F
2. For K K, we have:
}. : { } : {
,
F F F K F F
c
K F F
<< = =
F F
F
K 'F
c
. On the other hand, F'
F
c
implies that K (F')
c
, so we have K F
c
and K F =. The theorem is proven.
From topology to metric: need and possi-
bility. We have defned topology on the class of
all closed sets. Topology describes the intuitive
notion of closeness in qualitative terms. From
the viewpoint of applications, it is convenient
to use quantitative measures of closeness such
as a metric.
Before we start describing a correspond-
ing metric, let us frst prove that this metric is
indeed possible, that is, that the corresponding
topological space is metrizable. It is known that
for every continuous lattice, the corresponding
Lawson topology is compact and Hausdorff. Let
us prove that for LCHS spaces X, the Lawson
topology on (F(X), ) is not only compact and
Hausdorff, it is also second countableand
therefore metrizable.
Theorem 2. For every LCHS X, the Lawson
topology on (F(X), ) is second countable.
Proof. This proof is given in Mathron (1975)
for the hit-or-miss topology. We reproduce it here
for completeness.
Since X is locally compact Hausdorff second
countable space, there exists a countable basis
B for the topology O consisting of relatively
compact sets B B (i.e., sets whose closure B
is compact). The fact that B is a basis means that
every open set G O can be represented as a
union of open sets from this basis, that is, that it
can be represented in the form
i
I i
B G
= , where
B
i
B, and G Bi .
By defnition of the hit-or-miss topology, its sub-
base consists of the sets
} : { = = A F F
A
F F
for
compact sets A and sets } : { = A F F
A
F F
for open sets A. Thus, the base of this topology
26
Random Fuzzy Sets
consists of the fnite intersections of such sets. The
intersection of two sets } : { = = A F F
A
F F
def
and } : { = =
A F F
A
F F
def
consists of all the
sets F which do not have common points neither
with A nor with A'; this is equivalent to saying that
F has no common points with the union A A', that
is, that
A A A A
= F F F . Thus, fnite intersections
which form the basis of the hit-or-miss topology
(or, equivalently, Lawson topology) have the form
} , , 2 , 1 , , : {
, ,
1
n i G F K F F
i G G
K
n
= = = F F
def
for all possible K K and G
i
O.
Let us consider the following subset of this
basis:
{ }
. , , 0 , ;
1
1
, ,
B F T =
j i
D
B B
D B k m
k D
m
Clearly T is countable. We need to verify that
T is a basis for the Lawson topology. For this, we
need to prove that for every element F F in
every open neighborhood of F, there is an open
set from T that contains F. It is suffcient to prove
this property only for the neighborhood from the
known basis.
Indeed, let F F and let the open set
K
n
G G , ,
1
F
be a neighborhood of F. We need to fnd a U T
such that .
, ,
1
K
n
G G
U F
F We will prove this by
considering two possible cases: F = and F .
If F =, then we cannot have F G
i
, so n
must be 0, and
K
G G
K
n
F F =
, ,
1
B. Thus,
k D D K 1 . Clearly, the empty set has no
common point with anyone, so
.
1 k D D
F
F
If a set has no common points with a larger
set, then it will no common points with a subset
either; so we conclude that
k D D k
F F F
1
. So,
we can take
k D D
U
=
1
F as the desired neigh-
borhood.
If
F =, then the fact that F belongs to the
neighborhood
K
n
G G , ,
1
=
i
n
i
B F
1
are disjoint and the space X is Hausdorff,
we can fnd two disjoint open sets A
1
and A
2
which
contain these sets: K A
1
and 2
1
A B F i
n
i
. Since B
is a basis, we can represent the open set A
1
as a union
of open sets fromB:
j
j
D A K =
1
for some D
j
B
for which
1
A Dj . The set K is compact and thus,
fromthis open cover of K, we can extract a fnite
sub-cover, that is, we have:
,
2 1
1 1
c
j
k
j
j
k
j
A A D D K
= =
and =
=
2
1
A Dj
k
j
.
Thus, .
, , , ,
1
1
1
K
n
k D
n
G G
D
B B
U F
F F
def
=
The theorem is proven.
METRICS ON CLOSED SETS
From theoretical possibility of a metric to a
need for an explicit metric. In the previous sec-
tion, we have shown that for every LCHS X (in
particular, for X =R
d
), when we defne Lawson
topology on the set F(X) of all closed subsets of
X, then this set F(X) becomes metrizable. This
means that in principle, this topology can be
generated by a metric.
However, from the application viewpoint, it
is not enough to consider a theoretical possibility
of a metric, it is desirable to describe a specifc
explicit example of a metric.
Known metric on the class of all closed
sets: Hausdorff metric. Intuitively, the metric
describes the notion of a distance between two
27
Random Fuzzy Sets
sets. Before we start describing an explicit metric
which is compatible with the Lawson topology,
let us frst describe natural ways to describe such
a distance.
For points x, yR
d
, the standard distance d(x, y)
can be described in terms of neighborhoods: the
distance is the smallest value e >0 such that x
belongs to the
e-neighborhood of y and y is in the
e-neighborhood of x.
It is natural to extend the notion of e-neigh-
borhood from points to sets. Namely, a set A is
collection of all its points; thus, an e-neighborhood
(A) N of a set A is a collection of e-neighborhoods
of all its points. In other words, we can defne:
}. some for ) , ( : { ) ( A a a x X x A N =
def
Now, we can defne the distance d
H
(A, B) be-
tween the two sets as the smallest smallest value
e > 0 for which A is contained in the e-neighbor-
hood of B and B is contained in the e-neighbor-
hood of A:
)}. ( & ) ( : 0 { inf ) , ( A N B B N A B A H > =
def
This metric is called the Hausdroff metric
(after the mathematician who proposed this
defnition).
Examples. Hausdorff distance between a
one-point set {0.3} and an interval [0,1] is 0.7:
indeed, all the elements from the set {0.3} are
contained in the interval [0,1], and every element
of the interval [0,1] is contained in the 0.7-vicinity
of the point 0.3.
This example can be viewed as a degenerate
case of computing the Hausdorff distance be-
tween two intervals
] , [ a a
and ] , [ b b namely,
the case when a a = . In general, the Hausdorff
distance between the two intervals is equal to
|) | |, max(| b a b a - - .
Limitations of the known metric. Hausdorff
metric works well for bounded sets in R
d
or, more
generally, for subsets of a compact set. However,
if we use the Haudorff metric to described dis-
tances between arbitrary closed sets A, B R
d
,
we often get meaningless infnities: for example,
in the plane, the Hausdorff distance between any
two non-parallel lines is infnity.
How to overcome these limitations: the no-
tion of compactifcation. To overcome the above
limitation, it is desirable to modify the defnition
of a Hausdorff metric.
Since Hausdorff metric works well for compact
spaces, a natural idea is to somehow embed the
original topological spaces into a compact set.
Such a procedure is known as compactifcation.
Simplest compactifcation. In the simplest
compactifcation, known as Alexandroff (or one-
point) compactifcation, we simply add a single
point to the original space X.
The corresponding topology on X {} is
defned as follows:
if a set S G does not contain the new point
, then it is open if and only if it was open
in the original topology;
if a set S contains , then it is called open if
and only if its complement S
c
is a compact
subset of the original space X.
From points to closed sets. This compactif-
cation can be extended to an abritrary closed set
F: namely, a set F which was closed in X is not
necessarily closed in the new space X {} so
we have to take the closure
F
of this set.
If we have a matric d' on the compactifcation,
then we can defne a distance between closed sets
F, F' F as ) , ( F F H
.
Compactifcation of R
d
. For R
d
, the one-point
compactifcation can be implemented in a very
natural way, via a so-called stereographic pro-
28
Random Fuzzy Sets
jection. Specifcally, we can interpret a one-point
compactifcation of the space R
d
as a sphere S
d
R
d +1
. The correspondence between the original
points of R
d
and S
d
is arranged as follows. We
place the space R
d
horizontally, and we place the
sphere S
d
on top of this plane, so that its South
poleis on that plane. Then, to fnd the image p (x)
of a point x R
d
on the sphere, we connect this
point x with the North pole of the sphere by a
straight line, and take, as p (x), the intersection
between this line and the sphere. In this manner,
we cover all the points on the sphere except for
the North pole itself. The North pole corresponds
to infnityin the sense that if x , then p (x)
tends to the North pole.
In this sense, S
d
is a one-point compactifcation
of the original space R
d
: it is a compact space,
and it is obtained by adding a point (North pole)
to (the image of) R
d
.
From points to closed sets. By using this
construction, we can naturally fnd the projec-
tion of an arbitrary closed set F F (R
d
) as the
collection of all the projections of all its points,
that is, as p(F) ={p(x) : x F}. The only minor
problem is even while we started with a closed
set F, this collection p(F), by itself, may be not
closed. For example, a straight line on a plane is
closed, but its projection is not closedsince it
has point arbitrary close to the North pole but not
the North pole itself.
To resolve this problem, we can then take
a closure of this image. Thus, we arrive at the
following stereographic Hausdorff metric
) ) ( , ) ( ( F F H
where | |I denotes the cardinality of the set I.
It turns out that under these conditions, there
is indeed such a measure.
Theorem (Choquet) Let K : T R. Then the
following two statements are equivalent to each
other:
there exists a probability measure Q on
(F) for which ) ( ) ( K T Q
K
= F for all K
K;
T satisfes the conditions i, ii and iii.
If one of these statements is satisfed, then the
corresponding probability measure Q is uniquely
determined by T.
For a proof, see Mathron (1975).
This theorem is the counter-part of the Leb-
esgue-Stieltjes theorem characterizing probability
measures on the Borel -feld of R
d
in terms of
multivariate distribution function S of random
vectors. For subsequent developments of RCS,
see for example Nguyen (2006).
FROM RANDOM SETS TO RANDOM
FUZZY SETS
Need for fuzziness. As stated in the Introduction,
random set observations are coarse data contain-
ing the true, unobservable outcomes of random
experiments on phenomena.
A more general type of random and imprecise
observations occurs when we have to use natural
language to describe our perception. For example,
the risk is high is an observation contain-
ing also imprecision at at a higher level due to
the fuzziness in our natural language. Modeling
fuzziness in natural language, that is, modeling
the meaning of terms is crucial if we wish to
process information of this type.
How we can describe fuzziness. Following
Zadeh, we use the theory of fuzzy sets to model
the meaning of a nature language; see, for ex-
ample, Nguyen and Walker(2006) for fuzzy sets
and fuzzy logics.
The theory is in fact valid for any LCHS space
X. For concreteness, one can keep in mind an
example X= R
d
. By a fuzzy subset of X we mean
a function ] 1 , 0 [ : X f where f(x) is the degree
30
Random Fuzzy Sets
to which the element x is compatible with the
fuzzy concept represented by f. This function f
is also called a membership function.
For example, the membership function f of the
fuzzy concept A=small non-negative numbers
of R could be f(x)=0 if x <0,e
-x
for x 0, where
f(x) =e
-x
is the degree to which x is considered
as a small non-negative number.
Ordinary subsets of X are special cases of
fuzzy subsets of X, where for A X, its indica-
tor function I
A
: R
d
] 1 , 0 [ } 1 , 0 { , I
A
(X) =1 if x
A, and I
A
(X) =0 if x A, is the membership
function of A.
An alternative way to describe fuzzy sets: a-
cuts. Informally, the fact that the actual (unknown)
value
act
X satisfes the property described by a
fully set f means the following:
with confdence 1, the actual value x
act
satis-
fes the condition f(x
act
) > 0;
with confdence 0.9, the actual value x
act
satisfes the condition f(x
act
) >1 0.9 =0.1,
that is, belongs to the set {x : f (x) 0.1};
and so on.
In view of this interpretation, instead of de-
scribing the fuzzy set by its membership function
f(x), we can alternatively describe it by the corre-
sponding sets } ) ( : { = x f X x A
def
for differ-
ent a [0,1]. These sets are called alpha-cuts.
Alpha-cuts are nested in the sense that if a <
a' , then A
a
A
a
. So, a fuzzy set can be viewed
as a nested family of sets (corresponding to dif-
ferent degrees of confdence); see, for example,
Klir and Yuan (1995), Nguyen and Kreinovich
(1996), and Nguyen and Walker (2006).
Fuzzy analog of closed sets. In our description
of random sets, we limited ourselves to closed
sets. Since a fuzzy set can be viewed as a nested
family of sets (its alpha-cuts), it is reasonable to
consider fuzzy sets in which all alpha-cuts are
closed sets.
In other words, it is reasonable to restrict
ourselves to fuzzy sets ] 1 , 0 [ : X f which have
the following property: for every a R, the set
} ) ( : { = x f X x A is a closed subset of X.
In mathematics, functions f with this property
are called upper semicontinuous (usc, for short).
We will thus call a fuzzy set a fuzzy closed set if
its membership function is usc.
Closed sets are a particular case of fuzzy
closed sets. We have already mentioned that tra-
ditional (crisp) sets are a particular case of fuzzy
sets. It is easy to check that closed crisp sets are
also closed as fuzzy sets.
Moreover, a subset of X is closed if and only if
its indicator function is upper semicontinuous.
Towards a defnition of a random fuzzy
closed set. To formalize the concept of random
fuzzy (closed) sets, we will therefore consider the
space USC(X) (where X is a LCHS space), of all
usc functions on X with values in [0,1].
As we mentioned earlier, a natural way to
defne a notion of the random fuzzy closed set
is to describe a topology on the set USC(X).
Similar to the case of closed sets, we will use the
Lawson topology generated by a natural order
on USC(X).
The space USC(X) has a natural pointwise
order: f g if and only if f (x) g(x) for all x X.
We can also consider the dual order f g (which
is equivalent to g f ). We will now look for the
corresponding Lawson topology!
THE CONTINUOUS LATTICE OF
UPPER SEMICONTINUOUS
FUNCTIONS
First try: USC(X), . Let X as any LCHS space.
By USC(X) we mean the set of all usc functions
] 1 , 0 [ : X f .
With the order relation , USC(X) is complete
lattice, where:
31
Random Fuzzy Sets
{ } J j x f x f
j j
J j
=
), ( inf ) ( and,
{ } A x x f
j
J j
=
: ] 1 , 0 [ sup ) (
where }. ) ( : { of closure =
y f X y A
j
J j
Remarks
(i) Clearly
) ( )} ( , , inf{ X USC X USC f J j f f
j j
= .
I ndeed, l et a [0,1] , then
} ) ( : { } ) ( : { < = <
x f x x f X x
j
J j
is an
open set.
(ii) Let us explain why we cannot simply use
pointwise supremum to defne .
Indeed, for
[ )
1 ), ( ) (
,
1
=
+
n x I x f
n
n
, we have
fn USC(X), but the pointwise supremum
[ )
{ }
) ( ) ( 1 ), ( sup
) , 0 (
,
1
X USC x I n x
I
n
n
/
=
+
+
.
To defne } : { J j f
j
, we proceed as follows:
For a [0,1], let
}, ) ( : { of closure =
y f X y A
j
J j
A
a
. As such f USC(X).
Next, for x X, we have
) (
)} ( ) ( : { of closure
x f i j
J j
i
A x f y f y x =
x f x y
j J j
, that is, for any
y for which f
j
(y) a for some j, we have g (y)
a if g f
j
for all i I. Thus, y A
a
(g), implying
that A
a
A
a
(g), since A
a
(g) is closed. But then
)} ( : ] 1 , 0 [ sup{ } : ] 1 , 0 [ sup{ g A x A x
and hence f g.
(iii) However, the complete lattice (USC(X), )
is not continuous.
The proof of this statement is similar to the
proof that the lattice (F(X) is not continuous.
Indeed, let ] 1 , 0 [ : X f be a function for which
f(x) =1 for all x X. Let us then show that the
zero function, that is a function g for which g(x)
=0 for all x X, is the only function in USC(X)
which is way below f.
It suffces to show that for any real number r
>0, the function ) (
} {
y
I r (e.g., ) (
} 0 { 2
1
I ), is not
way below f.
Let:
) , ( ) , (
1 1
) (
+ + - -
=
n n
y y
n
I x f
,
then 1
1
=
n
n
f , but for any K, we have
0 ) (
1
=
=
y f
n
K
n
,
thus
n
K
n
y
f I r
/
=1
} {
, implying that
} {y
I r is not
way below f.
Correct description: (USC,). Thus, as in
the case of F(X), we should consider the reverse
order .
32
Random Fuzzy Sets
Theorem 3. The complete lattice (USC(X),
), where } : { inf } : { J j f J j f
j j
=
and
h J j f
j
= } : { with:
} : ] 1 , 0 [ sup{ ) ( A x x h = and
}, ) ( : { of closure =
y f X y A
j
J j
is continuous.
Before proving this theorem, we need a repre-
sentation for elements of USC(X). For every real
number r and for every compact set K K(X), let
us defne an auxiliary function g
r, K
as follows:
g
r, K
(x) =r if x K and g
r, K
(x) =1 otherwise.
Lemma 3. Any f USC(X) can be written as
{ }, all for ) ( : ) ( inf ) (
,
K y r y f g f
K r
< = where
infmum is taken over all pairs r [0,1] and
K
K(X) for which f (y) < r for all y K.
Comment. In this defnition, the infmum of
an empty set is assumed to be equal to 1.
Proof. Let x X. Let us consider two possible
cases: f(x) =1 and f(x) <1.
(i) Let us frst consider the case when f(x) =1.
To prove the formula for this case, we will con-
sider two cases: when
K x
/
and when
K x .
In the frst subcase, when
K x
/
, we have g
r,
K
(x) = 1 by defnition of the function g
r, K
. Thus,
the infmum is equal to 1, that is, to f(x).
In the second subcase, when
K x
/
, then there
is no r such that f(y) <r for all y K. Thus, the
infmum is also equal to 1 = f(x).
(ii) Let us now consider the case when f(x) <1.
We need to prove that f(x) is the greatest lower
bound of the values g
r, K
(X). Let us frst prove
that f(x) is a lower bound. For that, we will again
consider two subcases:
K x
/
and when
K x .
When
K x
/
, we have f(x) <1 =g
r, K
(x). Thus, in both sub-
cases, f(x) is indeed a lower bound of {g
r, K
(x)}.
Let us now prove that f(x) is the greatest lower
bound. In other words, let us prove that for any
e >0, the value f(x) +e is not a lower bound of
{g
r, K
(x)}.
Indeed, let
2 0
) ( + = x f r , then x belongs to
the open set { }. ) ( ) ( :
2 0
+ = < x f r y f X y By
local compactness of X, we conclude that there
is a compact set
} ) ( : {
0 0
r y f y K <
such that
0
of all these
sets K
h
is the empty set. Indeed, if x
K
F h
, then by
{ }
. all for ) ( : inf } : ) ( { inf
,
K y r y f g f f g X USC g
K r
< = <<
Box 1.
33
Random Fuzzy Sets
defnition of K
h
, it means that r h(x) for all h
F. This will imply that r inf F on K, contradict-
ing the fact that r inf F on K.
Since every set K
h
is closed, its complement
c
h
K is open. From
=
h
F h
K
, we conclude that
X K
c
h
F h
=
and thus,
c
h
F h
K K
. Since K is a
compact set, from this open cover, we can extract
a fnite subcover, hence
c
i
h
n
i
K K
1 =
for some func-
tions h
i
. Thus, we have
c
h
n
i
K K
i
=1
. Since K
h
K
c
for all h, we thus conclude that .
1
=
=
i
h
n
i
K
Si nce F i s di rected, we have
. } , , 1 : { inf ) , , 1 : ( F n i h n i h h
i i
= = = =
def
For this h, we have = =
=
i
h
n
i
h
K K
1
. This means
that for every x K, we have h(x) <r.
Now, for an arbitrary point x X, we have two
possibilities: x X and K x
/
. If x X, then h(x)
<r < g
r,K
(x). On the other hand, if K x
/
, then
h(x) 1 and g
r,K
(x) =1, hence also h(x) g
r,K
(x).
So, for all x X, we have h(x) g
r,K
(x). The state-
ment is proven.
Towards an application-oriented descrip-
tion of the corresponding Lawson topology.
Since (USC,) is a continuous lattice, we can
defne the corresponding Lawson topology.
The Lawson topology is defned in very gen-
eral, very abstract terms. To be able to effciently
apply this abstractly defned topology to random
fuzzy closed sets, it is desirable to describe the
Lawson topology for such sets in easier-to-un-
derstand terms. Such a description is given by
the following result.
Theorem 4. The following sets form a subbase
for the Lawson topology on (USC(X),): the sets
of the form {f : f(y) < r for all y K} for all r
(0,1] and K K(X), and the sets {f : g(x) < f(x)
for somex X} for all g USC(X).
Proof. By definition, the Lawson topol-
ogy on USC(X), ) is generated by the sub-
base consisting of the sets {f : g <<f} and
} : { f g f / . For the ordered set (USC(X), ), clearly,
}. some for ) ( ) ( : { } : { X x x f x g f f g f < = /
Let us now considers sets {f : g <<f }. It is
known that these sets form a base of a topology
which is called Scott topology; see, for example,
Gierz et al. (1980). A Scott open set is thus an
arbitrary union of sets of the type {f : g <<f }
with different g. So, to prove our theorem, it is
suffcient to prove that the sets {f : f (y) < for all
y K form a subbase of the Scott topology.
To prove this, we frst prove that each such set
is indeed a Scott open set; this is proved in the
Lemma below. It then remains to verify that the
above sets indeed form a subbase for the Scott
topology.
Indeed, let A be a Scott open set which contains
an element h. Since } : { f g f A
g
<< = , we have
h {f : g <<f } for some g, that is, g << h. It is
easy to show that:
{ } { }. all for ) ( : all for ) ( : inf
, ,
,
K y r y h g K y r y h g h
K r K r
K r
< = < =
{ } { }. all for ) ( : all for ) ( : inf
, ,
,
K y r y h g K y r y h g h
K r K r
K r
< = < =
}. : ) ( { } all for ) ( : ) ( {
,
f g X USC f K y r y f X USC f
i
i
K r
K K
<< = <
Box 2.
}. : ) ( { } all for ) ( : ) ( {
,
f g X USC f K y r y f X USC f
i
i
K r
K K
<< <
Box 3.
34
Random Fuzzy Sets
Thus, by defnition of the way below rela-
tion, g < f implies that
= = < } , , 2 , 1 , all for ) ( : {
,
n i K y r y h g g
i i K r
i i
}, , , 2 , 1 , all for ) ( : { inf
,
n i K y r y h g
i i K r
i i
= <
and hence,
}. all for ) ( : {
1
i i
n
i
K y r y f f h <
=
1
i i
n
i
K y r y f f f <
=
Then:
{ }
f n i K y r y h g g
i i K r
i i
, then
for every y K, there exist r
y
and K
y
such that y
K
y
and ) ( ) (
,
z g r z f
i
K r y
< for all z K
y
. In
particular, for z = y, we r y g r y f
i
K r y
= < ) ( ) (
,
,
so that f(y) <r for all y K. The lemma is proven,
and so is the theorem.
METRICS AND CHOQUET
THEOREM FOR RANDOM FUZZY
SETS
Resulting formal defnition of a random fuzzy
set. As we have mentioned, in general, a random
object on a probability space (,A,P) is defned
a mapping O x : which is A-B-measurable
with respect to an appropriate -feld B of subsets
of the set O of objects.
For closed fuzzy sets, the set O is the set USC(X)
of all semicontinuous functions, and the appropri-
ate -algebra is the algebra L(X) of all Borel sets
in Lawson topology.
Thus, we can defne a random fuzzy (closed)
set S on a probability space (,A,P) as a map
) ( : X USC S which is A-L(X)-measurable.
Properties of the corresponding Lawson
topology. From the general theory of continuous
lattices, we can conclude that the space USC(X)
is a compact and Hausdorff topological space.
When X is a LCHS space, then, similarly to the
case of the set of all (crisp) closed sets F(X), we
can prove that the set of all fuzzy closed sets
USC(X) is also second countable (see the proof
below) and thus, metrizable.
Later in this section, we will discuss different
metrics on USC(X) which are compatible with
this topology, and the corresponding Choquet
theorem.
Theorem 5. The topological space USC(X)
with the Lawson topology is second countable.
Proof. It is known that for every continuous
lattice, the Lawson topology is second countable
if and only if the Scott topology has a countable
base (Gierz et al., 1980). Thus, to prove our result,
it is suffcient to prove that the Scott topology has
a countable base.
In view of the results described in the previous
section, the Scott topology has a base consisting
of sets of the form
35
Random Fuzzy Sets
}, all for ) ( : ) ( {
1
i i
n
i
K y r y f X USC f <
=
where r
i
(0,1], K
i
K(X), and n 0. Let us
denote
}. all for ) ( : { ) , (
i i i i
K y r y f f K r U < =
def
In these terms, the base of the Scott topology
consists of the fnite intersections
) , (
1
i i
n
i
K r U
=
.
Recall that since X is LCHS, X is normal, and
there is a countable base B of the topological space
(X,O) such that for every B B, the closure B is
compact, and for any open set G O, we have
j
J j
B G
=
for some B
j
B with B
j
G.
Our claim is that a countable base of USC(X)
consists of sets of the form:
= =
ij
m
j
i
n
i
B q U
i
1 1
,
,
where q
i
are rational numbers from the interval
(0,1], and B
ij
B.
It suffces to show that every neighborhood
) , (
1
i i
n
i
K r U
=
of a closed fuzzy set f
n
USC(X)
contains a sub-neighborhood
= =
ij
m
j
i
n
i
B q U
i
1 1
, (which still contains f ).
Indeed, by defnition of the sets U (r
i
,K
i
), the
fact that ) , (
1
i i
n
i
K r U f
=
means that for every i and
for every y K
i
, we have f(y) <r
i
. Let us denote
byA
i
the set of all the values x X for which f(x)
r
i
:A
i
={x X : f(x) r
i
}. Since the function f
is upper semicontinuous, the set A
i
is closed. The
sets A
i
and K
i
are both closed and clearly A
i
K
i
=. Since the space X is normal, there exists
an open set G
i
that separates A
i
and K
i
, that is,
for which K
i
G
i
and A
i
G
i
= . Due to the
property of the base, we have ij
J j
i
B G
=
with
i
ij G B . Thus,
ij
J j
i
B K
=
ij
m
j
i
B A
i
1
implying that for every
ij
m
j
B y
i
1 =
, we have
i
A y
/ , that is, f(y) <r
i
. Thus, f(y) <r
i
for any
ij
m
j
B y
i
1 =
.
Since f is usc, it attains its maximum on ij
m
j
B
i
1 =
at some point y
max
. For this maximum value, we
therefore also have f(y
max
) <r
i
and therefore, there
exists a rational number q
i
such that f(y
max
) <q
i
<r
i
. Since the value f(y
max
) is the maximum, we
conclude that f(y) f(y
max
) for all other ij
m
j
B y
i
1 =
Thus, for all such y, we have f(y) <q
i
. This means
that
= =
ij
m
j
i
n
i
B q U f
i
1 1
,
.
It is easy to show that:
). , ( ,
1
1
1
i i
n
i
ij
m
j
i
n
i
K r U B q U
i
=
=
=
The theorem is proven.
Towards defning metrics on USC(X). We
have proven that the Lawson topology on USC(X)
USC(X) is metrizable, that is, that there exists
a metric with is compatible with this topology.
From the viewpoint of possible applications, it is
desirable to give an explicit description of such
a metric.
In Section 4, we used the point compactifcation
procedure to explicit describe a specifc metric
compatible with the Lawson topology on the set
36
Random Fuzzy Sets
F(X) of all closed (crisp) sets. Let us show that
for the class USC(X) of closed fuzzy sets, we
can defne a similar metric if we identify these
closed fuzzy sets with their hypographs, that is,
informally, areas below their graphs.
Formally, for every function ] 1 , 0 [ : X f ,
its hypograph Hyp (f ) is defned as:
Hyp (f ) ={(x, a) X [0,1]: f(x) a
Every hypograph is a subset of X [0,1]. Since
we consider usc functions, each hypograph is
closed.
Let HYP (X) denote the set of all hypographs of
all functions f USC(X). Then, ) ( f Hyp f is
a bijection from USC(X) to HYP (X); see Nguyen,
Wang, and Wei (in press). One can easily check that
the set HYP (X) is a closed subset of F(X [0,1]).
Note that since X is a LCHS space, the set F(X
[0,1]) is also a LCHS space, so the set F(X [0,1])
of all its closed subsets is a topological space with
a canonical Lawson topology. Thus, according to
Section 4, F(X [0,1]) has a compatible metric
H. Then the induced metric on HYP (X) is also
compatible with the induced Lawson topology (or
hit-or-miss topology) on HYP (X). The only things
that remains to be proven is that (HYP (X), H) is
homeomorphic to (USC (X), L), where L denotes
the Lawson topology on USC (X). This result was
proven in Nguyen and Tran (2007).
Finally, a Choquet theorem for USC (X) can be
obtained by embedding USC (X) into F(X [0,1])
via hypographs and using Choquet theorem for
random closed sets on X [0,1]. For more details,
see Nguyen et al. (in press).
TOWARDS PRACTICAL
APPLICATIONS
From the general theory to computationally
feasible practical applications. In the previous
sections, we described a general theory of random
fuzzy sets, a theory motivated by (and tailored
towards) potential applications. Our main motiva-
tion is to prepare the background for as wide a
range of applications as possible. Because of this
objective, we tried our best to make our theory
as general as possible.
Because of this same application objective, we
also tried our best to make the resulting techniques
as computation-friendly as possible. However, as
we will see in this section, the resulting problems
are computationally diffcult even in the simplest
cases. Because of this diffculty, in this chapter, we
will mainly concentrate on such simple cases.
Practical need for random sets and random
fuzzy sets: reminder. We have started this pa-
per with explaining the practical motivation for
random sets and random fuzzy sets.
This motivation is that due to measurement
uncertainty (or uncertainty of expert estimates),
often, instead of the actual values x
i
of the quan-
tities of interest, we only know the intervals
]
~
,
~
[
i i i i
x x x + - =
, where
i
x
~
is the known
approximate value and
i
is the upper bound on
the approximation error (provided, for measure-
ments, by the manufacturer of the measuring
instrument).
These intervals can be viewed as random
intervals, that is, as samples from the interval-
valued random variable. In such situations, instead
of the exact value of the sample statistics such as
covariance C[x,y], we can only have an interval
C[x,y] of possible values of this statistic.
The need for such random intervals is well
recognized, and there has already been a lot of re-
lated research; see, for example, Moeller and Beer
(2004). In this approach, the uncertainty in a vector
quantity = ) , , (
1 d
x x x R
d
is usually described
by describing intervals of possible values of each
of its components. This is equivalent to describing
the set of all possible values of x as a box (multi-
dimensional interval) ] , [ ] , [ 1
1
n
n
x x x x .
However, the resulting data processing problem
are often very challenging, and there is still a
large room for further development.
37
Random Fuzzy Sets
One such need comes from the fact that
uncertainty is often much more complex than
intervals. For example, for the case of several
variables, instead of an multi-dimensional inter-
val, we may have a more complex set S R
d
. In
such a situation, we need a more general theory
of random sets.
We also have mentioned that to get a more
adequate description of expert estimates, we need
to supplement the set S of possible values of the
quantity (or quantities) of interest with describing
the sets S
a
which contain values which are pos-
sible with a certain degree a. In such situations,
we need to consider random fuzzy sets.
What is needed for practical applications: an
outline of this section. As we have just recalled,
there is a practical need to consider random sets
and random fuzzy sets. In order to apply the
corresponding theory, we frst need to estimate
the actual distribution of random sets or random
fuzzy sets from the observations.
In other words, we need to develop statistical
techniques for random sets and random fuzzy
sets. In this section, we start with a reminder
about traditional number-valued and vector-valued
statistical techniques, and the need for extending
these techniques to random sets and random fuzzy
sets. Then, we overview the main sources of the
corresponding data uncertainty and techniques
for dealing with the corresponding uncertainty.
This will prepare us for the case study described
in the following section.
Traditional statistics: brief reminder. In
traditional statistics, we assume that the observed
values are independent identically distributed
(i.i.d.) variables , , ,
1 n
x x , and we want to fnd
statistics ) , , (
1 n n
x x C that would approximate
the desired parameter C of the corresponding
probability distribution.
For example, if we want to estimate the
mean E, we can take the arithmetic average
n
x x
n
n
x E
+ +
=
1
] [ . It is known that as n , this
statistic tends (with probability 1) to the de-
sired mean:
E x E
n
] [
. Similarly, the sample
variance
2
1
1
1
]) [ ( ] [ x E x x V
n i
n
i
n n
- =
=
-
tends to
the actual variance V, the sample covariance
]) [ ( ]) [ (
1
1
] , [
1
y E y x E x
n
y x C
n i n i
n
i
n
- -
-
=
=
between two different samples tends to the actual
covariance C, and so on.
Coarsening: a source of random sets. In
traditional statistics, we implicitly assume that the
values x
i
are directly observable. In real life, due to
(inevitable) measurement uncertainty, often, what
we actually observe is a set S
i
that contains the
actual (unknown) value of x
i
. This phenomenon
is called coarsening; see, for example, Heitjan
and Rubin (1991). Due to coarsening, instead
of the actual values x
i
, all we know is the sets
, , ,
1 n
X X that are known the contain the
actual (un-observable) values x
i
: x
i
X
i
.
Statistics based on coarsening. The sets
, , ,
1 n
X X are i.i.d. random sets. We want to
fnd statistics of these random sets that would
enable us to approximate the desired parameters
of the original distribution x. Here, a statistic
) , , (
1 n n
X X S transform n sets
n
X X , ,
1
into
a new set. We want this statistic ) , , (
1 n n
X X S
to tend to a limit set L as n , and we want
this limit set L to contain the value of the desired
parameter of the original distribution.
For example, if we are interested in the mean
E[x], then we can take n X X S
n n
/ ) (
1
+ + =
(where the sum is the Minkowskielement-
wisesum of the sets). It is possible to show that,
under reasonable assumptions, this statistic tends
to a limit L, and that E[x] L. This limit can be
viewed, therefore, as a set-based average of the
sets
n
X X , ,
1
.
Important issue: computational complexity.
There has been a lot of interesting theoretical
research on set-valued random variables and
corresponding statistics. In many cases, the cor-
38
Random Fuzzy Sets
responding statistics have been designed, and
their asymptotical properties have been proven;
see, for example, Goutsias, Mahler, and Nguyen
(1997); Li, Ogura, and Kreinovich (2002); Nguyen
(2006); and references therein.
In many such situations, the main obstacle
to a practical use of these statistics is that going
from random numbers to random sets drastically
increases the computational complexityhence,
the running timeof the required computations.
It therefore is desirable to come up with new,
faster algorithms for computing such set-values
heuristics.
Sources of uncertainty: general reminder.
Traditional engineering statistical formulas as-
sume that we know the exact values x
i
of the
corresponding quantity. In practice, these values
come either from measurements or from expert
estimates. In both case, we get only approxima-
tions
i
x
~
to the actual (unknown) values x
i
.
When we use these approximate values
i i
x x
~
to compute the desired statistical characteristics
such as E and V, we only get approximate valued
E
~
and V
~
for these characteristics. It is desirable to
estimate the accuracy of these approximations.
Case of measurement uncertainty. Measure-
ments are never 100 percent accurate. As a result,
the result
i
x
~
of the measurement is, in general,
different from the (unknown) actual value x of
the desired quantity. The difference
x x x
def
- =
~
between the measured and the actual values is
usually called a measurement error.
The manufacturers of a measuring device
usually provide us with an upper bound
for the
(absolute value of) possible errors, that is, with a
bound
i
x
i
x .
Case of expert uncertainty. An expert usu-
ally describes his/her uncertainty by using words
from the natural language, like most probably,
the value of the quantity is between 6 and 7, but
it is somewhat possible to have values between 5
and 8. To formalize this knowledge, it is natural
to use fuzzy set theory, a formalism specifcally
designed for describing this type of informal
(fuzzy) knowledge (Klir & Yuan, 1995; Nguyen
& Walker, 2006).
As a result, for every value x
i
, we have a
fuzzy set m
i
(x
i
) which describes the experts prior
knowledge about x
i
: the number m
i
(x
i
) describes
the experts degree of certainty that x
i
is a possible
value of the i-th quantity.
As we have mentioned earlier, an alternative
user-friendly way to represent a fuzzy set is by
using its a-cuts } ) ( : {
i i i
x x . In these terms,
a fuzzy set can be viewed as a nested family of
intervals
)] ( ), ( [ i
i
x x
corresponding to dif-
ferent level a.
Estimating statistics under fuzzy uncertain-
ty: precise formulation of the problem. In gen-
eral, we have fuzzy knowledge m
i
(x
i
) about each
value x
i
; we want to fnd the fuzzy set corresponding to
a given characteristic ) , , (
1 n
x x C y = . Intuitively,
the value y is a reasonable value of the characteristic
if ) , , (
1 n
x x C y = for some reasonable values x
i
,
that is, if for some values
n
x x , ,
1
, x
1
is reasonable,
and x
2
is reasonable, , and ) , (
1 n
x x C y = . If
we interpret and as min and for some (or)
as max , then we conclude that the correspond-
ing degree of certainty m(y) in y is equal to
}. ) , , ( : )) ( , ), ( max{min( ) (
1 1 1
y x x C x x y
n n n
= =
Reduction to the case of interval uncer-
tainty. It is known that the above formula (called
extension principle) can be reformulated as fol-
lows: for each a, the a-cut y(a) of y is equal to
the range of possible values of ) , , (
1 n
x x C when
) (
i i
x x for all i. Thus, from the computa-
tional viewpoint, the problem of computing the
statistical characteristic under fuzzy uncertainty
can be reduced to the problem of computing this
characteristic under interval uncertainty; see, for
example, Dubois, Fargier, and Fortin (2005).
In view of this reduction, in the following
text, we will consider the case of interval (and
set) uncertainty.
Estimating statistics under interval uncer-
tainty: a problem. In the case of interval uncer-
tainty, instead of the true values
n
x x , ,
1
, we only
know the intervals ] , [ , ], , [ 1
1 1
n
n n
x x x x = = x x
that contain the (unknown) true values of the
measured quantities. For different values x
i
x
i
,
we get, in general, different values of the corre-
sponding statistical characteristic ) , , (
1 n
x x C .
Since all values x
i
x
i
are possible, we conclude
that all the values ) , , (
1 n
x x C corresponding to x
i
x
i
are possible estimates for the corresponding
statistical characteristic. Therefore, for the interval
data
n
x x , ,
1
, a reasonable estimate for the cor-
responding statistical characteristic is the range
}. , , : ) , , ( { ) , , (
1 1 1 1 n n n n
x x x x C C x x x x =
def
We must therefore modify the existing statis-
tical algorithms so that they compute, or bound
these ranges.
40
Random Fuzzy Sets
Estimating mean under interval uncertain-
ty. The arithmetic average E is a monotonically in-
creasing function of each of its n variables
n
x x , ,
1
- =
- ~
def
and
n
x x
i
i i
+ =
+ ~
def
, are proper subsets of one another,
that is, when
) , ( ] , [
+ - + -
/ j j i i
x x x x
for all i and j
(Dantsin et al., 2006).
What can be done if we cannot effectively
compute the exact range. As we have just
mentioned, the problem of computing statistical
characteristics under interval uncertainty is often
NP-hard which means, crudely speaking, that
we cannot effciently compute the exact range for
these characteristics.
A natural solution is as follows: since we can-
not compute the exact range, we should try to fnd
an enclosure for this range. Computing the range
) , , (
1 n
C x x of a function ) , , (
1 n
x x C based on
the input intervals x
i
is called interval computa-
tions; see, for example, (Jaulin et al., 2001).
Interval computations techniques: brief
reminder. Historically the frst method for com-
puting the enclosure for the range is the method
which is sometimes called straightforward
interval computations. This method is based on
the fact that inside the computer, every algorithm
consists of elementary operations (arithmetic
operations, min, max, etc.). For each elementary
operation f(a, b) if we know the intervals a and b
for a and b, we can compute the exact range f(a,
b). The corresponding formulas form the so-called
interval arithmetic. For example (see Box 4).
In straightforward interval computations, we
repeat the computations forming the program
C for computing ) , , (
1 n
x x C step-by-step, re-
placing each operation with real numbers by the
corresponding operation of interval arithmetic.
It is known that, as a result, we get an enclosure
) , , (
1 n
C x x Y for the desired range.
In some cases, this enclosure is exact. In more
complex cases (see examples below), the enclosure
has excess width.
There exist more sophisticated techniques for
producing a narrower enclosure, for example, a
centered form method. However, for each of these
techniques, there are cases when we get an excess
width. (Reason: as have mentioned, the problem
of computing the exact range is known to be NP-
hard even for population variance.)
]; , [ ] , [ ] , [ ]; , [ ] , [ ] , [ b a b a b b a a b a b a b b a a - - = - + + = +
)]. , , , ( max ), , , , ( min [ ] , [ ] , [ b a b a b a b a b a b a b a b a b b a a =
Box 4.
41
Random Fuzzy Sets
CASE STUDY: A BIOINFORMATICS
PROBLEM
In this section, we describe an example of a practi-
cal applications. This example was frst outlined
in (Kreinovich et al., 2007) and (Xiang, 2007).
For other applications, see (Kreinovich et al.,
2006), (Kreinovich et al., 2007) and references
therein.
Description of the case study. In cancer research,
it is important to fnd out the genetic difference
between the cancer cells and the healthy cells.
In the ideal world, we should be able to have a
sample of cancer cells, and a sample of healthy
cells, and thus directly measure the concentrations
c and h of a given gene in cancer and in healthy
cells. In reality, it is very diffcult to separate the
cells, so we have to deal with samples that contain
both cancer and normal cells. Let y
i
denote the
result of measuring the concentration of the gene
in i-th sample, and let x
i
denote the percentage
of cancer cells in i-th sample. Then, we should
have
i i i
y h x c x - + ) 1 (
(approximately equal
because there are measurement errors in measur-
ing y
i
).
Let us frst consider an idealized case in which
we know the exact percentages x
i
. In this case,
we can fnd the desired values c and h by solving
a system of linear equations
i i i
y h x c x - + ) 1 (
with two unknowns c and h.
It is worth mentioning that this system can be
somewhat simplifed if instead of c, we consider
a new variable
h c a - =
def
. In terms of the new
unknowns a and h, the system takes the following
form:
i i
y h x a + .
The errors of measuring y
i
are normally
i.i.d. random variables, so to estimate a and h,
we can use the Least Squares Method (LSM)
h a
n
i
i i
y h x a
,
1
2
min ) ( - +
=
. According to LSM, we
have
] [
] , [
x V
y x C
a = and ] [ ] [ x E a y E h - = , where
) ( ] [
1
1
n n
x x x E + + = ,
) ( ] [
1
1
n n
y y y E + + =
,
and , ]) [ ( ] [
2
1
1
1
x E x x V
i
n
i
n
- =
=
-
Once we know a =c
h and h, we can then estimate c as a + h.
The problem is that the concentrations x
i
come
from experts who manually count different cells,
and experts can only provide interval bounds on
the values x
i
such as x
i
[0.7, 0.8] (or even only
fuzzy bounds). Different values of x
i
in the cor-
responding intervals lead to different values of
a and h. It is therefore desirable to fnd the range
of a and h corresponding to all possible values
] , [ i
i i
x x x .
Comment. Our motivation for solving this
problem comes from bioinformatics, but similar
problems appear in various practical situations
where measurements with uncertainties are avail-
able and statistical data is to be processed.
Linear approximation. Let 2 ) (
~
i
i i
x x x + =
be the midpoint of i-th intervals, and let
2 / ) (
i
i
i
x x - = be its half-width. For a, we have
]). [ 2 2 ] [ (
] [ ) 1 (
1
x E a x a y E y
x V n x
a
i i
i
+ - -
-
=
1
] [
, so
.
1
i x
h
n
i
h
i
=
=
Prior estimation of the resulting accuracy.
The above formulas provide us with the accuracy
after the data has been processed. It is often desir-
able to have an estimate prior to measurements,
to make sure that we will get c and h with desired
accuracy.
The difference
i
y is a measurement error, so
it is normally distributed with 0 mean and standard
deviation (y) corresponding to the accuracy of
42
Random Fuzzy Sets
measuring y
i
. The difference
i
x is distributed
with 0 mean and standard deviation
] [x V
. For
estimation purposes, it is reasonable to assume
that the values
i
x are also normally distributed.
It is also reasonable to assume that the errors in x
i
and y
i
are uncorrelated, so the linear combination
i i
x a y - is also normally distributed, with
0 mean and variance
] [
2 2
x V a
y
+
. It is also
reasonable to assume that all the values
i
are
approximately the same:
i
.
For normal distribution x with 0 mean and
standard deviation , the mean value of | | is
equal to / 2 . Thus, the absolute value
| |
i i
x a y - of the above combination has a
mean value
] [ / 2
2 2
x V a
y
+
. Hence, the
expected value of
a
is equal to
] [
] [
2
2 2
x V
x V a
y
+
.
Since measurements are usually more accurate
than expert estimates, we have
] [
2
x V
y
<<
, hence
.
2
a
a
Similar estimates can be given for
h
.
Why not get exact estimates? Because in
general, fnding the exact range is NP-hard. Let
us show that in general, fnding the exact range for
the ratio C[x, y] / V[x] is an NP-hard problem; this
proof was frst presented in Kreinovich, Longpre,
Starks, Xiang, Beck, Kandathi, Nayak, Ferson,
and Hajagos (2007).
The proof is similar to the proof that comput-
ing the range for the variance is NP-hard (Ferson,
Ginzburg, Kreinovich, Longpre, & Aviles, 2005):
namely, we reduce a partition problem (known to
be NP-hard) to our problem. In the partition problem,
we are given m positive integers
m
s s , ,
1
, and we
must check whether there exist values e
i
{1,1} for
which 0
1
=
=
i i
m
i
s . We will reduce this problem to
the followi ng problem:
n =m +2, 0
1
= = =
m
y y , y
m+1
=1, y
m+2
=1, x
i
=[s
i
, s
i
]
for i m, x
m+1
=1, and x
m+2
=1. In this case:
E[y] =0, so
2
2
1
1
1
1
] [ ] [ ] , [
+ -
=
-
= - =
m n
n
i i
n
i
n
y E x E y x y x C .
Therefore, min ] [ ] , [ x V y x C if and only if
max ] [ x V .
Here,
2
1
2
1
1
2 2
1
1
1
2 ] [
+ =
=
+ +
+
=
+ i
m
i
m m
m
i
m
i
m
x x x V
.
Since
i i
s x | | , we always have:
+ =
=
+
2 ] [
2
1
1
1
0 i
m
i
m
s V x V
def
and the only possibility to have V[x]=V
0
is when
i i
s x = for all i and 0 =
i
x . Thus, V[x]=V
0
if
and only if the original partition problem has a
solution. Hence,
2
2
2
] [ / ] , [
+
=
i
s
x V y x C if and only
if the original instance of the partition problem
has a solution.
The reduction is proven, so our problem is
indeed NP-hard.
Comment 1. In this proof, we consider the case
when the values x
i
can be negative and larger than
1, while in bioinformatics, x
i
is always between
0 and 1. However, we can easily modify this
proof: First, we can shift all the values x
i
by the
same constant to make them positive; shift does
not change neither C[x, y] nor V[x]. Second, to
make the positive values 1, we can then re-
scale the values x
i
(
i i
x x ), thus multiplying
] [ / ] , [ x V y x C by a known constant.
As a result, we get new values ) / 1 (
2
1
K x x
i i
+ = ,
where
i
s K max
def
= , for which ] 1 , 0 [
i
x and the prob-
lem of computing ] [ / ] , [ x V y x C is still NP-hard.
Comment 2. Since we cannot compute the exact
range, what can we do to compute the more ac-
curate estimates for the range than those provided
by linear approximation? One possibility is to use
known algorithms to fnd the ranges for C[x, y]
and for V[x] (Kreinovich, Xiang, Starks, Longpre,
Ceberio, Araiza, Beck, Kandathi, Nayak, Torres,
43
Random Fuzzy Sets
& Hajagos, 2006; Kreinovich et al., 2007), and
then use the division operation from interval
arithmetic to get the interval that is guaranteed
to contain ] [ / ] , [ x V y x C .
ACKNOWLEDGMENT
This work was supported in part by NSF grants
HRD-0734825 and EAR-0225670, by Texas
Department of Transportation grant No. 0-5453,
and by the Japan Advanced Institute of Science
and Technology (JAIST) International Joint Re-
search Grant 2006-08. The authors are thankful
to the anonymous referees for their valuable
suggestions.
REFERENCES
Dantsin, E., Kreinovich, V., Wolpert, A., & Xiang,
G. (2006). Population variance under interval un-
certainty: A new algorithm. Reliable Computing,
12(4), 273-280.
Dubois, D., Fargier, H., & Fortin, J. (2005). The
empirical variance of a set of fuzzy intervals.
Proceedings of the 2005 IEEE International
Conference on Fuzzy Systems FUZZ-IEEE2005,
Reno, Nevada, May 2225, pp. 885-890.
Fell, J. M. G. (1961). The structure of algebras of
operator felds. Acta Mathematics, 101, 19-38.
Fell, J. M. G. (1962). A Hausdorff topology for the
closed sets of a locally compact non-Hausdorff
space. Proceedings of the American Mathematical
Society, 13, 472476.
Ferson, S., Ginzburg, L., Kreinovich, V., Longpre,
L., & Aviles, M. (2005). Exact bounds on fnite
populations of interval data. Reliable Computing,
11(3), 207-233.
Ferson, S., & Hajagos, J. (2007). Interval versions
of statistical techniques, with applications to en-
vironmental analysis, bioinformatics, and privacy
in statistical databases. Journal of Computational
and Applied Mathematics, 199(2), 418-423.
Gierz, G., Hofmann, K. H., Keimel, K., Lawson, J.
D., Mislove, M., & Scott, D. S. (1980). A compen-
dium of continuous lattices. Berlin, Heidelberg,
New York: Springer-Verlag.
Goutsias, J., Mahler, R. P. S., & Nguyen, H. T.
(Eds.) (1997). Random sets: Theory and applica-
tions. New York: Springer-Verlag.
Heitjan D. F., & Rubin, D. B. (1991). Ignorability
and coarse data. Ann. Stat., 19(4), 2244-2253.
Jaulin L., Kieffer, M., Didrit, O., & Walter,
E. (2001). Applied interval analysis. London:
Springer-Verlag.
Klir, G., & Yuan, B. (1995). Fuzzy sets and fuzzy
logic: Theory and applications. Upper Saddle
River, NJ: Prentice Hall.
Kreinovich V., Lakeyev, A., Rohn, J., & Kahl, P.
(1997). Computational complexity and feasibility
of data processing and interval computations.
Dordrecht: Kluwer.
Kreinovich, V., Longpre, L., Starks, S. A., Xiang,
G., Beck, J., Kandathi, R., Nayak, A., Ferson, S.,
& Hajagos, J. (2007). Interval versions of statistical
techniques, with applications to environmental
analysis, bioinformatics, and privacy in statistical
databases. Journal of Computational and Applied
Mathematics, 199(2), 418-423.
Kreinovich, V., Xiang, G., Starks, S. A., Longpre,
L., Ceberio, M., Araiza, R., Beck, J., Kandathi,
R., Nayak, A., Torres, R., & Hajagos, J. (2006).
Towards combining probabilistic and interval
uncertainty in engineering calculations: Algo-
rithms for computing statistics under interval
uncertainty, and their computational complexity.
Reliable Computing, 12(6), 471-501.
Li, S., Ogura, Y., & Kreinovich, V. (2002). Limit
theorems and applications of set valued and fuzzy
valued random variables. Dordrecht: Kluwer.
44
Random Fuzzy Sets
Matheron, G. (1975). Random sets and integral
geometry. J. Wiley.
Moller, & Beer, M. (2004). Fuzzy randomness:
Uncertainty in Civil Engineering and Computa-
tional Mechanics. Springer-Verlag.
Nguyen, H. T. (2006). An introduction to random
sets. Chapman and Hall/CRC Press.
Nguyen, H. T., & Kreinovich, V. (1996). Nested
intervals and sets: Concepts, relations to fuzzy
sets, and applications. In R.B. Kearfott& V. Kre-
inovich (Eds.). Applications of interval computa-
tions, pp. 245-290. Dordrecht: Kluwer.
Nguyen, H. T. & Tran, H. (2007). On a continu-
ous lattice approach to modeling of coarse data
in system analysis. Journal of Uncertain Systems,
1(1), 62-73.
Nguyen, H. T., & Walker, E. A. (2006). A frst
course in fuzzy logic (3rd edition). Chapman and
Hall/CRC Press.
Nguyen, H. T., Wang, Y., & Wei, G. (in press).
On Choquet theorem for random upper semi-
continuous functions. International Journal of
Approximate Reasoning.
Rabinovich, S. (2005). Measurement errors and
uncertainties: Theory and practice. New York:
Springer-Verlag.
Rockafellar, R. T., & West, J. B. (1984). Variational
systems an introduction. In Springer lecture notes
in mathematics, Vol. 1091, 154.
Wei, G., & Wang, Y. (in press). On metrization
of the hit-or-miss topology using Alexandroff
compactifcation. International Journal of Ap-
proximate Reasoning.
Xiang, G. (2007). Interval uncertainty in bioin-
formatics. Abstracts of the New Mexico Bioin-
formatics Symposium, Santa Fe, Mexico, March
89, p. 25.
45
Chapter III
Pattern Discovery in Gene
Expression Data
Grinne Kerr
Dublin City University, Ireland
Heather Ruskin
Dublin City University, Ireland
Martin Crane
Dublin City University, Ireland
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
ABSTRACT
Microarray technology
1
provides an opportunity to monitor mRNA levels of expression of thousands
of genes simultaneously in a single experiment. The enormous amount of data produced by this high
throughput approach presents a challenge for data analysis: to extract meaningful patterns, to evaluate
its quality, and to interpret the results. The most commonly used method of identifying such patterns
is cluster analysis. Common and suffcient approaches to many data-mining problems, for example,
Hierarchical, K-means, do not address well the properties of typical gene expression data and fail,
in signifcant ways, to account for its profle. This chapter clarifes some of the issues and provides a
framework to evaluate clustering in gene expression analysis. Methods are categorised explicitly in the
context of application to data of this type, providing a basis for reverse engineering of gene regulation
networks. Finally, areas for possible future development are highlighted.
INTRODUCTION
A fundamental factor of function in a living cell is
the abundance of proteins present at a molecular
level, that is, its proteome. The variation between
proteomes of different cells is often used to ex-
plain differences in phenotype and cell function.
Crucially, gene expression is the set of reactions
46
Pattern Discovery in Gene Expression Data
that controls the level of messenger RNA (mRNA)
in the transcriptome, which in turn maintains
the proteome of a given cell. The transcriptome
is never synthesized de novo; instead, it is main-
tained by gene expression replacing mRNAs that
have been degraded, with changes in composi-
tion brought about by switching different sets of
genes on and off. To understand the mechanisms
of cells, involved in a given biological process,
it is necessary to measure and compare gene
expression levels in different biological phases,
body tissues, clinical conditions, and organisms.
Information on the set of genes expressed, in
a particular biological process, can be used to
characterise unknown gene function, identify
targets for drug treatments, determine effects
of treatment on cell function, and understand
molecular mechanisms involved.
DNA microarray technology has advanced
rapidly over the past decade, although the concept
itself is not new (Friemert, Erfe, & Strauss, 1989;
Gress, Hoheisel, Sehetner, & Leahrach 1992). It
is now possible to measure the expression of an
entire genome simultaneously, (equivalent to the
collection and examination of data from thou-
sands of single gene experiments). Components
of the system technology can be divided into:
(1) Sample preparation, (2) Array generation
and sample analysis, and (3) Data handling and
interpretation. The focus of this chapter is on the
third of these.
Microarray technology utilises base-pair-
ing hybridisation properties of nucleic acids,
whereby one of the four base nucleotides (A, T,
G, C) will bind with only one of the four base
ribonucleotides (A, U, G, C: pairing =A U,
T A, C G, G - C). Thus, a unique sequence
of DNA that characterises a gene will bind to
a unique mRNA sequence. Synthesized DNA
molecules, complementary to known mRNA, are
attached to a solid surface, referred to as probes.
These are used to measure the quantity of specifc
mRNA of interest that is present in a sample (the
target). The molecules in the target are labelled,
and a specialised scanner is used to measure the
amount of hybridisation (intensity) of the target
at each probe. Gene intensity values are recorded
for a number of microarray experiments typically
carried out for targets derived under various
experimental conditions (Figure 1). Secondary
variables (covariates) that affect the relationship
between the dependent variable (experimental
condition) and independent variables of primary
interest (gene expression) include, for example,
age, disease, and geography among others, and
can also be measured.
Figure 1. mRNA is extracted from a transcriptome of interest, (derived from cells grown under precise
experimental conditions). Each mRNA sample is hybridised to a reference microarray. The gene intensity
values for each experiment are then recorded.
47
Pattern Discovery in Gene Expression Data
An initial cluster analysis step is applied to
gene expression data to search for meaningful
informative patterns and dependencies among
genes. These provide a basis for hypothesis test-
ingthe basic assumption is that genes, showing
similar patterns of expression across experimental
conditions, may be involved in the same underly-
ing cellular mechanism. For example, Alizadeh,
Eisen, Davis, Ma, Lossos, Rosenwald, Boldrick,
Sabet, Tran, Yu, Powell, Yang, Marti, Moore,
Hudson Jr, Lu, Lewis, Tibshirani, Sherlock, Chan,
Greiner, Weisenburger, Armitage, Warnke, Levy,
Wilson, Grever, Byrd, Botstein, Brown, and Staudt
(2000) used a hierarchical clustering technique,
applied to gene expression data derived from dif-
fuse large B-cell lymphomas (DLBCL), to identify
two molecularly distinct subtypes. These had gene
expression patterns, indicative of different stages
of B-cell differentiationgerminal centre B-like
DLBCL and activated B-like DLBCL. Findings
suggested that patients, with germinal centre B-
like DLBCL, had a signifcantly better overall sur-
vival rate than those with activated B-like DLBCL.
This work indicated a signifcant methodology
shift towards characterisation of cancers based
on gene expression, rather than morphological,
clinical and molecular variables.
BACKGROUND
The Gene Expression Dataset
Data are typically presented as a real-valued ma-
trix, with rows representing the expression of a
gene over a number of experiments, and columns
representing the pattern of expression of all genes
for a given microarray experiment. Each entry x
ij
is
the measured expression of a gene i in experiment
j, (Figure 1). The following terms and notations
are used throughout this chapter:
A gene/gene profle x is a single data item
(feature vector) used by the clustering al-
gorithm. It consists of d measurements, x
= (x
1
, x
2
, x
d
).
A condition y is a single microarray experi-
ment corresponding to a single column in
the gene expression matrix, y = (x
1
, x
2
,
x
n
)
T
, where n is the number of genes in the
dataset.
The individual scalar components of each
gene vector x
ij
represent the measured
expression of gene i under experimental
condition j.
Database Description URL
ArrayExpress Gene expression and hybridisation
array data repository
http://www.ebi.ac.uk/arrayexpress/#ae-main[0]
CellMiner Data from60 cancer cell lines based
on Affymetrix and cDNA microarray
data
http://discover.nci.nih.gov/cellminer
ExpressDB Collection of E. Coli and Yeast RNA
expression datasets
http://arep.med.harvard.edu/ExpressDB/
GEO Gene expression and hybridisation
array data repository
http://www.ncbi.nlm.gov/geo/
RAD Gene expression and hybridisation
array data repository
http://www.cbil.upenn.edu/RAD/
SMD Extensive collection of microarray data http://genome-www.stanford.edu/microarray
Table 1. Selection of publicly available dataset repositories
48
Pattern Discovery in Gene Expression Data
There are a number of publicly available dataset
repositories, which contain a wealth of microar-
ray datasets
2
: Table 1 provides a sample of these.
Typically, these repositories store data using the
minimum information about microarray experi-
ment (MIAME) standard (Brazma, Hingamp,
Quackenbush, Sherlock, Spellman, Stoeckert,
Aach, Ansorge, Ball, Causton, Gaasterland,
Glenisson, Holstege, Kim, Markowitz, Matese,
Parkinson, Robinson, Sarkans, Schulze-Kremer,
Stewart, Taylor, Vilo, & Vingron, 2001), which
allow researchers to replicate the experiments.
This allows analysts to compare gene expression
data from different laboratories effectively, based
on information about the microarrays used in
experiments, how these were produced, samples
obtained and mRNA extracted and labelled. Ad-
ditional information is also recorded on methods
used to hybridise the sample, scan the image and
normalise the data.
Characteristics of the Gene
Expression Dataset
Choice of the appropriate clustering technique
relies on the amount of information on the par-
ticular properties of gene expression data available
to the analyst, and hence the likely underlying
structure. The following data characteristics are
typical of the gene expression dataset:
Measurement accuracy of mRNA expres-
sion levels depends on the experimental design
and rigour. While design of experiments is not
a specifc focus of this chapter, a good design
minimises variation and has a focused objective
(Kerr & Churchill, 2001). Technical variation
between microarray slides depends on numerous
factors including experimental technique, instru-
ment accuracy for detecting signals, and observer
bias. Biological variation may arise due to dif-
ferences in the internal states of a population of
cells, either from predictable processes, such as
cell cycle progression, or from random processes
such as partitioning of mitochondria during cell
division, variation due to subtle environmental
differences, or ongoing genetic mutation (Raser &
OShea, 2005). Pre-processing techniques attempt
to remove technical variation while maintaining
interesting biological variation.
Many variables, both random and fxed,
(biological and technical), are associated with
microarray measurements. Data is thus intrinsi-
cally noisy and outliers in the dataset need to be
identifed and managed effectively. This usually
takes one of two forms, (i) outlier accommodation;
uses a variety of statistical estimation or testing
procedures, which are robust against outliers, (ii)
identifcation and decision on inclusion/exclusion,
used when outliers may contain key information
(Liu, Cheng, & Wu, 2002). Normalisation proce-
dures applied to gene expression data (Bolstad,
Irizarry, Astrand, & Speed, 2003), aim at minimis-
ing the effect of outliers (assuming these to be due
to experimental variation and thus undesirable).
Most manufacturers of microarrays, aware of
effects of optical noise and non-specifc binding,
include features in their arrays to measure these
directly: these measurements can be used in the
normalisation procedures. Note: although pre-
processing methods attempt to remove all noise
these may be only partially successful.
Missing values are common to microarray
data, and can be caused by insuffcient resolu-
tion in image analysis, image corruption, dust
or scratches on the slide, robotic method used to
create the slide, and so on, (Troyanskaya, Cantor,
Sherlock, Brown, Hastie, Tibshirani, Botstein, &
Altman, 2001). In general, the number of missing
values increases with the number of genes being
measured. Many clustering algorithms, used for
gene expression data, require a complete matrix of
input values. Consequently, imputation or missing
data estimation techniques need to be considered
in advance of clustering. The effect of missing
data on pattern information can be minimised
through pre-processing.
Commonly, missing values in the gene ex-
pression matrix are replaced by zeroes or by an
49
Pattern Discovery in Gene Expression Data
average expression level of the gene, (or row
average). Such methods do not, however, take
into account the correlation structure of the data
and more sophisticated options include K-Nearest
Neighbour (KNN) and Support Vector Decom-
position type methods. Troyanskaya et al. (2001)
note that KNN and SVD-based methods are more
effective than traditional methods of replacement,
with KNN being more robust as the number of
missing values increases.
Clustering algorithms that permit overlap
(probabilistic or fuzzy clusters) are typically
more applicable to gene expression data since: (i)
the impact of noisy data on clusters obtained is a
fundamental consideration in algorithm choice.
(The assumption is that noisy genes are unlikely
to belong to any one cluster, but are equally likely
to be members of several clusters): (ii) the underly-
ing principal of clustering gene expression data, is
that genes with similar change in expression for a
set of conditions are involved, together, in a simi-
lar biological function. Typically, gene products
(mRNA) are involved in several such biological
functions and groups need not be co-active under
all conditions. This gives rise to high variability
in the gene groups and/or some overlap between
them. For these reasons, constraining a gene to a
single cluster (hard clustering) is counter-intuitive
with respect to natural behaviour.
Additionally, methods that aim at a partial
clustering tend to be more suited to expression
data, with some genes or conditions not members
of any cluster (Maderia & Oliveira, 2000). Clus-
tering the microarray dataset can be viewed in
two ways: (i) genes can form a group which show
similar expression across conditions, (ii) condi-
tions can form a group which show similar gene
expression across all genes. It is this interplay of
conditions and genes that gives rise to bi-clusters,
whereby conditions and genes are simultaneously
grouped. Such partial clusterings, (or bi-clusters),
are defned over a subset of conditions and a
subset of genes, thus capturing local structure in
the dataset. Clearly, this allows: (i) noisy genes
to be left out, with correspondingly less impact
on the fnal outcome, (ii) genes belonging to no
clusteromitting a large number of irrelevant
contributions, (iii) genes not belonging to well-
defned groups. (Microarrays measure expres-
sion for the entire genome in one experiment,
but genes may change expression, independent
of the experimental condition, [e.g. due to stage
in cell cycle]. Forced inclusion of such genes in
well-defned but inappropriate groups may impact
the fnal structures found for the data).
Methods of Identifying Groups of
Related Genes
Cluster defnition is dependent on clearly defned
metrics, which must be chosen to refect the data
basis. Metric categories include:
Similarity-Based
The cluster is defned to be the set of objects in
which each object is closer, (or more similar), to
a prototype that defnes that cluster as opposed
to any other cluster prototype. A typical gene
expression cluster prototype is often the average
or centroid of all gene vectors in the cluster. The
similarity metric used affects the cluster produced.
Common measures include: (i) Euclidean distance,
(ii) Manhattan distance, and (iii) Squared Pearson
correlation distance (Quakenbush, 2001), with the
last being the most popular as it captures gene ex-
pression shape without regard to the magnitude
of the measurements. However, this distance mea-
surement is quite sensitive to outliers, although,
correlation, rather than distance, is inherently
more important for gene expression data. Take,
for example, two gene vectors X
1
=(1,2,3,4,5) and
X
2
=(3,6,9,12,15). These two profles result in a
Euclidean distance of 14.8323 and a Manhattan
distance of 30. The Pearson correlation distance
however is 0, refecting the fact that the two genes
are showing the same patterns of expression.
50
Pattern Discovery in Gene Expression Data
Density-Based
Clusters, in this instance, are based on dense re-
gions of genes, surrounded by less-dense regions.
Such methods are often employed when clusters
are irregular or intertwined, and when noise and
outliers are present (Sander, Ester, Kriegel, &
Xu, 1998). However, as each cluster is assumed
to have a uniform density, the method is not read-
ily applicable to gene expression data, as some
biological functions involve more gene products
than others. The high dimensionality also means
that density thresholds can be diffcult to defne
and expensive to compute.
Model-Based
Despite the convenience of similarity-based
measures, it can be biologically meaningless to
characterise a cluster through a cluster prototype,
such as the mean or centroid, as these may be
poorly representative of the cluster elements as
a whole. As a typical gene expression dataset is
large, noisy distortion of these prototypes may be
considerable, resulting in relatively uninformative
structures. In contrast, model-based techniques,
applied to expression space, consider the ft
of genes in a given cluster to the ideal cluster.
Concentrating on the strengths of the bi-clustering
approach, and following notation from Maderia
and Oliveira (2004), four types of model can be
identifed:
(i) Bi-clusters with constant values. A per-
fect cluster is a sub-matrix (I,J) of the gene
expression matrix (N,D), with all values
equal, x
i,j
= . The ideal bi-cluster is, of
course, rarely found in noisy gene expres-
sion data.
(ii) Bi-clusters with constant values on rows
or columns. A subset of the ideal or
constant bi-cluster model, and one which is
more realistic for gene expression data is a
sub-matrix with constant rows or columns.
For the former, rows have constant value in
a sub-matrix (I,J) given by a
ij
= +
i
or a
ij
=
i
, where is the typical bi-clus-
ter value and
i
is the row offset for i I.
Similarly, perfect bi-clusters with constant
columns can be obtained for a
ij
= +
j
or
a
ij
=
j
, where j J.
(iii) Bi-clusters with coherent values. From ii, a
combined additive model can be derived. In
this framework, a bi-cluster is a sub-matrix
(I,J), with coherent
3
values, based on the
model:
a
ij
= +
i
+
j
Eq. 1
(where ,
i
and
j
are as for (ii)). Similarly, the
multiplicative model assumes that a perfect bi-
cluster could be identifed using a
ij
=m' a'
i
b'
j
. Note: the additive form clearly follows for =
log(m'),
i
= log(a'
i
) and
j
= log(b'
j
).
The artifcial example in Figure 2(a) and (b)
illustrates this point. The sub-matrix is an ideal
bi-cluster found in a fctional dataset, where =1,
the offset for row 1 to 3 is
1
=0,
2
=2,
3
=4 re-
spectively, and the offset for columns 1 to 6 is
1
=0,
2
=1,
3
=2,
4
=4,
5
=1,
6
=-1 respectively.
The expression levels can be obtained from Eq.
1. Of course, when searching the dataset for a ft
to this model, the mean and offset parameters are
unknown and must be estimated from the data.
The schematic illustrates the coherent expression
profle over the six conditions. Similarly for the
multiplicative model where =1,
1
=1,
2
=2,
3
=5,
1
=1,
2
=2,
3
=4,
4
=6,
5
=3,
6
=1.5.
In reality, these perfect bi-clusters are, of
course, unlikely to occur, so each entry in the
sub-matrix can be regarded as having a residue
component (Cheng & Church, 2000):
r
ij
= +
i
+
j
ij
. Eq. 2
51
Pattern Discovery in Gene Expression Data
Thus, fnding bi-clusters is equivalent to
fnding sub matrices that minimise the average
residue.
(iv) Bi-clusters with coherent evolution. Local
structures, with coherent evolution across a
sub-matrix (I,J), can exist in the data regard-
less of the exact values. This occurs if there
is a pattern of co-regulation for a subset of
genes and conditions. Expression can occur
at different levels, so for example if two genes
are up-regulated by different degrees, (e.g.
due to a specifc condition), these are said
to experience coherent evolution.
Taking Figure 2(c) as an example. Gene 1 and
gene 2 are regulated, with similar periodicity,
while gene 3 shows alternated periodicity. Al-
though the genes are expressed at different levels,
each change in expression level is triggered by the
same condition. In a simple form, each gene can
be said to be exhibiting three states, down-regu-
lated, up-regulated or no change. Additional states
can be used, for example strongly up-regulated,
weakly up-regulated etc. depending on the detail
of the model required. Adding additional states,
of course, adds complexity to the model, and cut-
off points between states of regulation must be
considered carefully. The problem then reduces
to fnding profles that show consistent patterns
of regulation across all conditions.
CLUSTER ANALYSIS
Current Methods
With extensive choice of metric, structure, com-
pleteness etc. in cluster analysis it is useful to
consider a framework (Table 2) for performance
comparison. The taxonomy used is due to Jains,
Murty, and Flynn (1999).
Hierarchical Methods
Ever since the landmark paper of Eisen et al.
(1998), numerous clustering algorithms have been
applied to gene expression data. Predominantly
these have been hierarchical methods, (Higgins,
Shinghal, Gill, Reese, Terris, Cohen, Fero, Pollack,
van de Rijn, & Brooks, 2003; Khodursky, Peter,
Cozzarelli, Botstein, Brown, & Yanofsky, 2000;
Makretsov, Huntsman, Nielsen, Yorida, Peacock,
Cheang, Dunn, Hayes, van de Rijn, Bajdik, &
Gilks, 2004; Wen, Fuhrman, Michaels, Carr,
Smith, Barker, & Somogyi, 1998), due mainly to
ease of implementation, visualisation capability
and general availability.
The basic steps of a hierarchical clustering
algorithm include: (i) computation of the proximity
matrix of distances between each gene, (initially
Figure 2. Models in gene expression datasets. The matrix gives clusters found, where rows are gene
expression values across 6 experimental conditions (columns). X-axis indicates experimental condition
or time point, y-axis indicates gene expression level. Model forms are (a) Additive for rows and columns,
(b) Multiplicative for rows and columns and (c) Coherent evolution.
52
Pattern Discovery in Gene Expression Data
each is in a unique cluster of size one), (ii) searching
the proximity matrix for the two closest clusters,
(iii) merging these two clusters and updating the
proximity matrix, and (iv) repeating steps two and
three until all genes are in one cluster.
Such agglomerative clustering techniques
vary with respect to the (i) distance metric used
and the decision on cluster merger (that is linkage
choice as single, complete, average or centroid;
see Quackenbush [2001]). Typically output of a
hierarchical clustering algorithm is a dendogram,
representing nested patterns in the data and the
similarity level at which clusters are merged. The
choice of parameters affects both structure of, and
relationship between the clusters. Hierarchical
cluster structure works well for situations where
membership is crisp, but, despite their popularity
these methods may not be appropriate to capture
natural structures in gene expression data.
Nevertheless, some successes of clustering
conditions based on gene expression have been
reported. For example, Makrestov et al. (2004)
used gene expression profles, to determine
whether sub-types of invasive breast cancer could
be identifed, with a view to improving patient
prognosis. Hierarchical clustering successfully
identifed three cluster groups with signifcant
differences in clinical outcome. Similarly, a study
on renal cell carcinoma, Higgins et al. (2003),
found that hierarchical clustering led to segre-
gation of histologically
distinct tumour types
solely based on their gene expression patterns
(p. 925). These studies indicate that characterisa-
tion of tumours is potentially viable from gene
expression profling.
Hierarchical clustering algorithm properties
include location of complete clusters, forced
membership and large time-space complexity,
Common Clustering techniques
Gene
Membership
Cluster
Structure
Cluster Type Complete/Partial
Hierarchical (Eisen,
Spellman, Brown, &
Botstein, 1998)
Hard Hierarchical
(nested)
Similarity-Based Complete
K-Means (Tavazoie,
Hughes, Campbell,
Cho, & Church, 1999)
Hard No struc-
ture
Similarity-Based Complete
FCM (Gasch & Eisen,
1999)
Fuzzy No struc-
ture
Similarity-Based Complete
SOM (Golub, Slonim,
Tamayo, Huard,
Gaasenbeek, Mesirov,
Coller, Loh, Down-
ing, Caligiuri, Bloom-
feld, & Lander, 1999)
Hard Topological
Structure
Similarity and
Neighbourhood
kernal function-
based
Complete
Delta clusters (Cheng
& Church, 2000)
Shared Overlap Based on
Coherent Additive
Model
Partial
FLOC (Yang, Wang,
Wang, & Yu, 2003)
Shared Overlap Based on Coher-
ent Additive
Model
Partial
SAMBA (Tanay,
Sharan, & Shamir,
2002)
Shared Overlap Based on Coher-
ent Evolution
Model
Partial
Table 2. Popular clustering techniques applied to gene expression data. Partial (overlapping) clusters
are more relevant in this context.
53
Pattern Discovery in Gene Expression Data
but inclusion of noisy genes in the cluster can
affect the fnal grouping, (depending to a greater
or lesser extent on the linkage method and the
distance measure used). As algorithms are pro-
totype-based, further iterations exacerbate noise
effects. Given the distance metric basis, hierar-
chical techniques also tend to produce globular
structures.
Partitive Methods
In contrast to hierarchical algorithms, which create
clusters in a bottom up fashion resulting in nested
levels of clustering, partitive methods optimise a
function of given criteria, partitioning the entire
dataset and obtaining one cluster structure.
Partitive K-Means clustering (MacQueen,
1967) produces hard clustering with no structural
relationship between the individual clusters. The
main steps of the K-means algorithm are: (i)
Identifcation K prototype vectors for K clusters
in the dataset. (ii) Assignment of each gene to
a cluster based on its similarity to the cluster
prototype, (iii) computation of cluster proto-
types based on current genes in the cluster, (iv)
repeating steps two and three until convergence
criteria are satisfed. These may be for example
no (or minimal) reassignment of genes to new
clusters or for example minimal improvement in
optimisation of the criteria function. A typical
optimisation approach is to minimise the squared
error within a cluster:
) , (
1 1
j i
n
i
ij
k
j
q x d y C
= =
= Eq. 3
where q
j
is the vector representing the mean of
the cluster, x
i
is the vector representing the gene,
d(x
i
,q
j
) is a distance measure and y
ij
is a partition
element. Here y
ij
{0,1}, and y
ij
=1, indicates that
gene i is assigned to cluster j.
An example of use of the K-means method is
discussed in Tavazoie et al. (1999), and is based
on a yeast time-course gene expression dataset,
containing profles for more than 6000 genes,
with 15 time points (at 10 minute intervalsover
nearly two cell cycles), (Cho, Campbell, Winzeler,
Steinmetz, Conway, Wodicka, Wolfsberg, Gabri-
elian, Landsman, Lockhart, & Davis,, 1998). (This
work succeeded in identifying transcriptional co-
regulated genes in yeast). Unfortunately, initial
prototype vectors in K-Means usually have a large
impact on the data structures found. Prototype
vectors are often genes selected at random from
the dataset. Alternatively, Principal Component
Analysis can be used to project the data to a
lower dimensional sub-space and K-means is
then applied to the subspace (Zha, Ding, Gu, He,
& Simon, 2002). Whichever method is used in
practice to select prototype vectors, it is usually
the case that different initial prototypes are inves-
tigated to assess stability of the results, with the
best confguration, (according to the optimisation
criteria), used as output clusters.
Fuzzy Methods
As observed, (Section Characteristics of the Gene
Expression Dataset), multiple cluster membership
is more appropriate for gene expression data. The
Fuzzy C-Means (FCM) algorithm extends the
standard K-means algorithm, to the case where
each gene has a membership degree indicating
its fuzzy or percentage association with the
centroid of a given cluster. Typically, each gene
has a total membership value of 1, which is di-
vided proportionally between clusters according
to its similarity with the cluster means. A fuzzy
partition matrix Y, (of dimension NK, where K
is the number of clusters and N is the number of
genes), is created, where each element y
ij
is the
membership grade of gene i in cluster j and a
weighted version of Eq. 3 applies. At each itera-
tion, the membership value, y
ij
, and the cluster
center, k
j
, is updated by:
1
2
1
) (
) (
1
-
=
-
-
=
m
k
c c i
j i
ij
k x d
k x d
y
Eq. 4
54
Pattern Discovery in Gene Expression Data
=
=
=
N
i
m
ij
N
i
i
m
ij
j
y
x y
k
1
1
Eq. 5
where m>1 denotes the degree of fuzziness,
(everything else is as for Eq. 3). The iterations stop
when max < -
+
| |
1 k
ij
k
ij
j j , where is a termina-
tion criterion with value between 0 and 1, and k
is the number of iterations.
Given the usual constraint that membership
values of a gene must sum to unity, these values
should be interpreted with care. A large mem-
bership value does not indicate strength of
expression but rather reduced co-membership
across several clusters (Krishnapuram & Keller,
1993). Table 3 illustrates this idea for three clusters.
FCM was carried out on published yeast genomic
expression data (Gasch & Eisen, 2002; results
available at http://rana.lbl.gov/FuzzyK/data.html).
The membership values for gene B and gene D
are very different for cluster 21, although they
are approximately equidistant from the centroid
of the cluster. Similarly, gene C and gene D have
comparable membership values for cluster 4.
However, gene C is more typical than gene D.
With similar centroid distance measures, member-
ship value for gene B in cluster 21 is smaller than
membership value of gene A in cluster 46. These
values arise from the constraint that membership
values must sum to unity across all clusters, forcing
a gene to give up some of its membership in one
cluster to increase it in another. Listing the genes
of a cluster, based on membership values alone
is somewhat non-intuitive as it is not a measure
of their compatibility with the cluster. However,
if interpretation of the list in terms of degree of
sharing between clusters is of value.
The work of Gasch and Eisen (2002) on the
use of FCM in analysing microarray data looked
at clustered responses of yeast genes to environ-
mental changes. Groups of known functionally co-
regulated genes, and novel groups of co-regulated
genes, were found by this method, although missed
by both hierarchical and K-means methods.
Artifcial Neural Networks
Artifcial neural networks (ANN) mimic the idea
of biological neural networks, where links between
various neurons (nodes) can be strengthened or
weakened through learning. A number of ANN
types have been explored, with Self-Organising
Maps (SOM) (Kohonen, 1990) proving popular
for the analysis of gene expression, as these pro-
vide a fast method of visualising and interpreting
high dimensional data. The network maps the
high-dimension input gene vector into a lower
dimensional space. A SOM is formed by an input
layer of D nodes, (where D is the gene vector di-
mension), and an output layer of neurons arranged
in a regular grid (usually of 1 or 2 dimensions).
A vector, of the same dimension as the input
gene, references each node in the output layer.
Briefy, the mechanism involves: (i) initialisation
GID Cluster 4 Cluster 21 Cluster 46
Centroid Dist. Mem. Centroid Dist. Mem. Centroid Dist. Mem.
GENE1649X 10.691 0.002575 8.476 0.002002 3.864 0.482479
GENE6076X 6.723 0.009766 3.855 0.009341 6.33 0.007381
GENE5290X 6.719 0.007653 5.29 0.00515 8.024 0.005724
GENE2382X 7.725 0.007609 3.869 0.01782 6.279 0.010249
Table 3. Diffculties interpreting membership values for FCM. GENE1649X (Gene A), GENE6076X (Gene
B), GENE5290X (Gene C) and GENE2382X (Gene D). The table highlights distance to cluster centroid,
in terms of Euclidean distance, and the associated membership values of the gene.
55
Pattern Discovery in Gene Expression Data
of the prototype vectors of the output nodes, (ii)
training the network to fnd clusters in the data
(Genes are selected at random from the dataset
and the closest output neuron is identifed by
its prototype vector. Once an output neuron is
identifed, its topological neighbours are updated
to refect this. Training continues until the refer-
ence vectors satisfy a stopping criterion), and (iii)
Sequential application of all gene vectors to the
SOM, where only one output neuron fres upon
receiving an input gene vector. Members of a
cluster, represented by output neuron i, are the
set of genes, applied to the input neurons, causing
output neuron i to fre.
ANN techniques have been used in a number
of gene expression studies, including Tamayo,
Slonim, Mesirov, Zhu, Kitareewan, Dmitrovsky,
Lan-der, and Golub (1999) (to analyse haemato-
poietic differentiation); Toronen, Kolehmainen,
Wong, & Castren (1999) (to analyse yeast gene
expression data); and Golub et al. (1999) (to cluster
acute lymphoblastic leukaemia [ALL] and acute
myeloid leukaemia [AML]). The stopping crite-
rion of the SOM is crucial, since over-ftting to the
training dataset is a risk. A further disadvantage
is the amount of prior information needed. SOM
requires input parameters such as learning rate,
neighbourhood size and kernel function, as well
as topology of the map, (typically hexagonal or
square). The stability of results is also an issue,
as a particular gene vector can be found to cause
different output nodes to fre at different iterations,
(Jains et al. 1999). Furthermore, clusters produced
by the SOM are sensitive to choice of initial vec-
tors for the output neurons, and a sub-optimal
structure may result from a poor selection.
Search Based Methods
While methods considered so far focus on fnding
global structures in the data, local structures are
frequently of great interest. Cheng and Church
(2000) adapted work of Hartigan (1972) for gene
expression data, producing simultaneous clusters
of genes and conditions and an overall partial
clustering of the data. From Eq. 1 each value a
ij
of a sub-matrix can be defned from the typical
value within the bi-cluster a
IJ
, plus the offsets
for the row mean, a
iJ
-a
IJ
and column mean a
Ij
-
a
IJ
. Thus, each value in the sub-matrix should
(ideally) be:
a
ij
=a
ij
a
iJ
a
Ij
+a
IJ
Eq. 6
The Cheng and Church technique defnes the
Mean Residue Score (H) of a sub-matrix, based
on Eq. 2 and Eq. 6, such that:
2
,
) (
| || |
1
) , (
=
J j I i
ij
r
J I
J I H
Eq. 7
The algorithm carries out greedy iterative
searches for sub-matrices (I,J), which minimise
this function (Eq. 7), generating a large time cost,
as each row and column of the dataset must be
tested for deletion. (A sub-matrix is considered
a bi-cluster if its Mean Residue Score falls below
a user specifed threshold). A further overhead
is the masking of a bi-cluster with random
numbers once it is found, to prevent fnding the
same clusters repeatedly on successive iterations.
There is, nevertheless, high probability that this
replacement with random numbers affects the
discovery of further bi-clusters. To overcome this
random interference Yang et al. (2003) devel-
oped fexible overlapped bi-clustering (FLOC),
generalising the model of Cheng and Church to
incorporate null values.
For both the Cheng and Church (2000) al-
gorithm and the FLOC generalisation, K can
be specifed to be much larger than the desired
number of groups, without affecting the outcome,
as each row and column in the expression matrix
can belong to more than one bi-cluster, (Cheng
& Church, 2000; Yang et al., 2003). Selecting K
then reduces to selecting the percentage of the
bi-clusters with the best Mean Residue-score (Eq.
7). The cost is increased computation timethe
Cheng and Church algorithm fnds one bi-cluster
56
Pattern Discovery in Gene Expression Data
at a time while FLOC fnds all simultaneously.
However, a major additional strength of the bi-
clustering techniques is the minimal requirement
for domain knowledge. Also, as FLOC accounts
for null values in the dataset, the preliminary
imputation of missing values is not necessary.
Graph theoretic Methods:
A further approach, which is proving useful in the
analysis of large complex biological datasets, is
that of graph theory, (Aiello, Chun, & Lu, 2000;
Aittokallio & Schwitowski, 2006; Guillaume &
Latapy, 2006; Maslov, Sneppen, & Zaliznyak,
2004). A given gene expression dataset can be
viewed as a weighted bipartite graph, G= (V, U,
E), where V is the set of gene vertices, U is the set
of condition vertices, with V U = , and E is
the set of edges, with (u, v) E having a weight
a
uv
proportional to the strength of the expression
of gene v under condition u (Figure 3).
Analysis of the models involved focuses on
identifcation of similarly or densely connected
sub graphs of nodes and, of course, relies greatly
on the method used to defne the edge weights.
Clustering by graph network leads to similar is-
sues as before; (i) results are highly sensitive to
data quality and input parameters, (ii) predicted
clusters can vary from one method of graph
clustering to another. Clusters that share nodes
and edges of a graph networks, clearly overlap
and as noted in the section titled Characteristics
of the Gene Expression Dataset, are desirable for
gene expression interpretation.
The Statistical Algorithmic Method for Bi-
cluster Analysis (SAMBA) (Tanay et al., 2000),
uses a graphical approach and, unlike previous
bi-clustering techniques, fnds coherent evolution
in the data. An edge is defned to exist between
a gene node u and a condition node v if there is
signifcant change in expression of gene u under
condition v, relative to the genes normal level (a
non-edge exists if u does not change expression
under v). Each edge and non-edge is then weighted,
based on a log likelihood model, with weights:
0 log
) , (
>
v u
c
P
p
for edges, and 0
) 1 (
) 1 (
log
) , (
<
-
-
v u
c
P
P
for non-edges. Eq. 8
Here, P
(u,v)
is the fraction of random bipartite
graphs, with degree sequence identical to G, that
contain edge (u,v), and P
c
is a constant probability
assigned by the user. For
) , ( ) , (
max
v u V U v u c
P P
> ,
edges are taken to occur in a bi-cluster with equal
probability. Weights as determined by Eq. 8 are
assigned to the edges and non-edges in the graph.
A major strength of this method is that statisti-
cal signifcance of any sub graph is then simply
determined by its weight.
Cluster Evaluation and Comparison
Evaluation is not particularly well developed for
clustering analyses applied to gene expression
data, as very little may be known about the da-
taset beforehand. Many clustering algorithms are
designed to be exploratory; producing different
clusters according to given classifcation criteria
and will discover a structure, meaningful in that
context, which may yet fail to be optimal or even
biologically realistic. For example, for K-Means
the best structure is one that minimises the
sum of squared errors (MacQueen, 1967) while,
for the Cheng and Church algorithm (Cheng &
Church, 2000), it is that which minimises of the
Mean Residue-Score (Eq. 7). The two may not
Figure 3. Bipartite graph representing expression
for seven genes under fve conditionsedges
indicate a change in expression
57
Pattern Discovery in Gene Expression Data
be directly comparable, as the former highlights
global patterns in the data and the latter local
patterns. While larger deviations from the mean
may also correspond to large residue scores this
will not always be the case. For example, Figure
4 highlights a simple situation with three genes
in a cluster. According to the K-means criterion,
the within cluster distance is approximately 11.02,
based on Euclidean distance and centroid of the
cluster. The Mean Residue Score is 0. Reduc-
ing the scale of profle 1 by one third, (Figure
4(b)), decreases the within-cluster distance to
7.91, while increasing the Mean Residue Score
slightly to 0.0168. Both (a) and (b) are roughly
equivalent. Consequently, interpretation of clus-
ter results relies on some level of subjectivity as
well as independent validation and integration
of fndings. Subjective evaluation, even for low
dimensional data, is non-trivial at best, but be-
comes increasingly diffcult for high dimensional
gene expression data. Clearly, each technique will
fnd patterns even if these are not meaningful in
a biological context.
The benefts of the individual techniques, as
applied to gene expression data, were highlighted
in the last section. This section aims at providing
navigational guidelines for some degree of objec-
tive evaluation and comparison.
Determining the Correct Number of
Clusters
Cluster number, in the absence of prior knowledge,
is determined by whether a non-random structure
exists in the data. Limiting to a specifc number
of groups will bias the search, as patterns tend
to be well-defned only for the strongest signals.
Commonly, statistical tests for spatial randomness
test if a non-random structure exists, but identif-
cation of small, biologically meaningful clusters
remains non-trivial. This is particularly true of
standard methods, which fnd global structure,
but lack fne-tuning to distinguish local struc-
tures. Selection of the correct number of clusters
(K) thus is inherently iterative. Near optimal K
should clearly minimise heterogeneity between
groups, while maximising homogeneity within
groups, but determining the number of signifcant
clusters relies, not only on direct extraction (or
assessment) but also on appropriate hypothesis
testing. Direct methods are based on various of
criteria
4
. Nevertheless, improvement in terms of
identifcation of local clusters is slight. Specifc
tests include Gap Statistic (Tibshirani, Walther, &
Hastie, 2001), Weighted Discrepant Pairs (WADP)
(Bittner, Meltzer, Chen, Jiang, Seftor, Hendrix,
Radmacher, Simon, Yakhini, Ben-Dor, Sampas,
Dougherty, Wang, Marincola, Gooden, Lueders,
Glatfelter, Pollock, Carpten, Gillanders, Leja,
Dietrich, Beaudry, Berens, Alberts, & Sondak,
Figure 4. Gene expression profles for two equivalent clusters; cluster in (B) has Profle 1 scaled down
by one third
58
Pattern Discovery in Gene Expression Data
2000), and a variety of permutation methods
(Bittner et al., 2000; Fridlyand & Dudoit, 2001).
Since most involve bootstrapping, these methods
can be computationally very expensive. Compari-
son of methods for selecting the number of groups
is discussed by Milligan and Cooper (1985) and,
more recently, by Fridlyand and Dudoit (2001),
who note that no existing tests are optimal for
gene expression data.
Comparing Results from Clustering
Algorithms
Numerical measures of cluster goodness include
cluster cohesion (compactness or tightness),
that is how closely related genes in a cluster are,
while measures of cluster separation (isolation),
determine how distinct each cluster is. The Group
Homogeneity Function, is often used to measure
the association (distinctiveness) of genes within
and between groups.
Comparison with Metadata
Including biological function information in the
gene list for each cluster inevitably provides a
more complete picture of the dataset and of the
success of the technique. This information can
be used to validate the clusters produced, and a
number of functional annotation databases are
available. The Gene Ontology database (Ash-
burner, Ball, Blake, Botstein, Butler, Cherry,
Davies, Dolinski, Dwight, Epping, Harris, Hill,
Issel-Tarver, Kasarskis, Lewis, Matese, Rich-
ardson, Ringwald, Rubin, & Sherlock, 2000) for
example, provides a structured vocabulary that
describes the role of genes and proteins in all