0% found this document useful (0 votes)
14 views7 pages

2002 Spectral Kernel Methods For Clustering

This paper presents new spectral kernel methods for unsupervised clustering, utilizing eigenvectors from kernel matrices to optimize two cost functions: Alignment and Cut Cost. The authors propose algorithms that approximate optimal label assignments by relaxing the discrete optimization problem into continuous ones, demonstrating their effectiveness on real-world datasets. Results indicate that these methods can achieve high accuracy in clustering without relying on labeled data.

Uploaded by

stiva jobs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views7 pages

2002 Spectral Kernel Methods For Clustering

This paper presents new spectral kernel methods for unsupervised clustering, utilizing eigenvectors from kernel matrices to optimize two cost functions: Alignment and Cut Cost. The authors propose algorithms that approximate optimal label assignments by relaxing the discrete optimization problem into continuous ones, demonstrating their effectiveness on real-world datasets. Results indicate that these methods can achieve high accuracy in clustering without relying on labeled data.

Uploaded by

stiva jobs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Spectral Kernel Methods for Clustering

N ello Cristianini John Shawe-Taylor Jaz Kandola


BIOwulf Technologies Royal Holloway, University of London
[email protected] {john, jaz} @cs.rhul.ac.uk

Abstract
In this paper we introduce new algorithms for unsupervised learn-
ing based on the use of a kernel matrix. All the information re-
quired by such algorithms is contained in the eigenvectors of the
matrix or of closely related matrices. We use two different but re-
lated cost functions, the Alignment and the 'cut cost'. The first
one is discussed in a companion paper [3], the second one is based
on graph theoretic concepts. Both functions measure the level of
clustering of a labeled dataset, or the correlation between data clus-
ters and labels. We state the problem of unsupervised learning as
assigning labels so as to optimize these cost functions. We show
how the optimal solution can be approximated by slightly relaxing
the corresponding optimization problem, and how this corresponds
to using eigenvector information. The resulting simple algorithms
are tested on real world data with positive results.

1 Introduction
Kernel based learning provides a modular approach to learning system design [2]. A
general algorithm can be selected for the appropriate task before being mapped onto
a particular application through the choice of a problem specific kernel function.
The kernel based method works by mapping data to a high dimensional feature
space implicitly defined by the choice of the kernel function. The kernel function
computes the inner product of the images of two inputs in the feature space. From
a practitioners viewpoint this function can also be regarded as a similarity measure
and hence provides a natural way of incorporating domain knowledge about the
problem into the bias of the system.
One important learning problem is that of dividing the data into classes according
to a cost function together with their relative positions in the feature space. We
can think of this as clustering in the kernel defined feature space, or non-linear
clustering in the input space.
In this paper we introduce two novel kernel-based methods for clustering. They both
assume that a kernel has been chosen and the kernel matrix constructed. The meth-
ods then make use of the matrix's eigenvectors, or of the eigenvectors of the closely
related Laplacian matrix, in order to infer a label assignment that approximately
optimizes one of two cost functions . See also [4] for use of spectral decompositions
of the kernel matrix. The paper includes some analysis of the algorithms together
with tests of the methods on real world data with encouraging results.
2 Two partition cost measures
All the information needed to specify a clustering of a set of data is contained in
the matrix Mij = (cluster(xi) == cluster(xj)), where (A == B) E {-I, +1}. After
a clustering is specified, one can measure its cost in many ways. We propose here
two cost functions that are easy to compute and lead to efficient algorithms.
Learning is possible when some collusion between input distribution and target
exists, so that we can predict the target based on the input. Typically one would
expect points with similar labels to be clustered and the clusters to be separated.
This can be detected in two ways: either by measuring the amount of label-clustering
or by measuring the correlation between such variables. In the first case, we need
to measure how points of the same class are close to each other and distant from
points of different classes. In the second case, kernels can be regarded as oracles
predicting whether two points are in the same class. The 'true' oracle is the one
that knows the true matrix M. A measure of quality can be obtained by measuring
the Pearson correlation coefficient between the kernel matrix K and the true M .
Both approaches lead to the same quantity, known as the alignment [3].
We will use the following definition of the inner product between matrices
(K 1 ,K2)F = 2:2j=1 K 1 (Xi,Xj)K2(Xi,Xj). The index F refers to the Frobenius
norm that corresponds to this inner product.
Definition 1 Alignment The (empirical) alignment of a kernel kl with a kernel
k2 with respect to the sample S is the quantity

...1(S,k1 ,k2) = (K 1 ,K2)F ,


yi(K1 ,K1 )F(K2,K2)F
where Ki is the kernel matrix for the sample S using kernel k i .
This can also be viewed as the cosine of the angle between to bi-dimensional vectors
Kl and K 2, representing the Gram matrices. If we consider k2 = yy', where y is
the vector of { -1, + I} labels for the sample, then with a slight abuse of notation

AA(Sk)= (K,yy')F (K,yY')F. (") 2


, ,y / = mllKllF ' smce yy,yy F = m
V (K, K) F (YY' , yy') F
Another measure of separation between classes is the average separation between
two points in different classes, again normalised by the matrix norm.
Definition 2 Cut Cost. The cut cost of a clustering is defined as
k(Xi XJ)
L..'J:Y;li~IIF'
" ' .. -t-

C(S, k, y) = .
This quantity is motivated by a graph theoretic concept. If we consider the Kernel
matrix as the adjacency matrix of a fully connected weighted graph whose nodes
are the data points, the cost of partitioning a graph is given by the total weight of
the edges that one needs to cut or remove, and is exactly the numerator of the 'cut
cost'. Notice also the relation between alignment and cutcost:
'" k(x· x·) - 2C(S k)
...1(S, k, y) L..ij " J , = T(S k) - 2C(S k )
myi(K, K)F ' , ,y,

where T(S,k) = ...1(S, k,j), for j the all ones vector. Among other appealing
properties of the alignment, is that this quantity is sharply concentrated around
its mean, as proven in the companion paper [3]. This shows that the expected
alignment can be reliably estimated from its empirical estimate A.(S). As the cut
cost can be expressed as the difference of two alignments
C(S,k,y) = O.5(T(S,k) - A.(S, k,y)), (1)
it will be similarly concentrated around its expected value.

3 Optimising the cost with spectral techniques


In this section we will introduce and test two related methods for clustering, as
well as their extensions to transduction. The general problem we want to solve is
to assign class-labels to datapoints so as to maximize one of the two cost functions
given above. By equation (1) the optimal solution to both problems is identical for
a fixed data set and kernel. The difference between the approaches is in the two
approximation algorithms developed for the different cost functions. The approxi-
mation algorithms are obtained by relaxing the discrete problems of optimising over
all possible labellings of a dataset to closely related continuous problems solved by
eigenvalue decompositions. See [5] for use of eigenvectors in partitioning sparse
matrices.

3.1 Optimising the alignment


To optimise the alignment, the problem is to find the maximally aligned set of labels
A A (K,yy')F
A*(S,k)= max A(S,k,y)= max
yE{ -1 ,1}= yE{ -l ,l}= mJ(K, K)F
Since in this setting the kernel is fixed maximising the alignment reduces to choos-
ing y E {-I, l}m to maximise (K,yy') = y'Ky. If we allow y to be chosen from
the larger set IRm subject to the constraint IIyl12 = m, we obtain an approximate
maximum-alignment problem that can be solved efficiently. After solving the re-
laxed problem, we can obtain an approximate discrete solution by choosing a suit-
able threshold to the entries in the vector y and applying the sign function. Bounds
will be given on the quality of the approximations.
The solution of the approximate problem follows from the following theorem that
provides a variational characterization of the spectrum of symmetric matrices.
Theorem 3 (Courant-Fischer Minimax Theorem) If ME IRmxm is symmet-
ric, then for k = 1, ... , m,
v'Mv v'Mv
Ak(M) = max min - - = min max - - ,
dirn(T)=k OopvET v'v dirn(T)=m - k+lO opvET v'v
If we consider the first eigenvector, the first min does not apply and we obtain that
the approximate alignment problem is solved by the first eigenvector, so that the
maximal alignment is upper bounded by a multiple of the first eigenvalue, Arnax =
maxOopv EIR= v:~v. One can now transform the vector v into a vector in {-I, +l}m
by choosing the threshold 8 that gives maximum alignment of y = sign(vrnaX - 8).
By definition, the value of alignment A.(S, k, y) obtained by this y will be a lower
bound of the optimal alignment, hence we have
A.(S,k,y):s A.*(S,k):S Amax/IIKIIF.
One can hence estimate the quality of a dichotomy by comparing its value with the
upper bound. The absolute alignment tells us how specialized a kernel is on a given
dataset: the higher this quantity, the more committed to a specific dichotomy.
The first eigenvector can be calculated in many ways, for example the Lanczos
procedure, which is already effective for large datasets. Search engines like Google
are based on estimating the first eigenvector of a matrix with dimensionality more
than 10 9 , so for very large datasets there are approximation techniques.
We applied the procedure outlined above to two datasets from the VCI repository.
We preprocessed the data by normalising the input vectors in the kernel defined
feature space and then centering them by shifting the origin (of the feature space)
to their centre of gravity. This can be achieved by the following transformation of
the kernel matrix, K +--- K - m - 1jg' - m - 1gj' + m - 2 j'KjJ, where j is the all
ones vector, J the all ones matrix and 9 the vector of row sums of K.

Eigenva lue Number

(a) (b)
Figure 1: (a) Plot of alignment of the different eigenvectors with the labels or-
dered by increasing eigenvalue. (b) Plot for Breast Cancer data (linear kernel) of
.Amax/llKIIF (straight line), ...1(S, k, y) for y = sign(v maX - (}i ) (bottom curve), and
the accuracy of y (middle curve) against threshold number i.
The first experiment applied the unsupervised technique to the Breast Cancer data
with a linear kernel. Figure l(a) shows the alignmment of the different eigenvectors
with the labels. The highest alignment is shown by the last eigenvector correspond-
ing to the largest eigenvalue.
For each value (}i of the threshold Figure l(b) shows the upper bound of .Amax/llKIIF
(straight line), the alignment ...1(S, k, y) for y = sign( v max - (}i) (bottom curve), and
t he accuracy of y (middle curve). Notice that where actual alignment and upper
bound on alignment get closest, we have confidence that we have partitioned our
data well, and in fact the accuracy is also maximized. Notice also that the choice of
the threshold corresponds to maintaining the correct proportion between positives
and negatives. This suggests another possible t hreshold selection strategy, based on
the availability of enough labeled points to give a good estimate of the proportion
of positive points in the dataset. This is one way label information can be used
to choose the threshold. At the end of the experiments we will describe another
'transduction' method.
It is a measure of how naturally t he data separates that t his procedure is able
to optimise the split with an accuracy of approximately 97.29% by choosing the
threshold that maximises the alignment (threshold number 435) but without making
any use of the labels.
In Figure 2a we present the same results for the Gaussian kernel (u = 6). In this
case the accuracy obtained by optimising the alignment (threshold number 316)
of t he resulting dichotomy is less impressive being only about 79.65%. Finally,
Figure 2b shows the same results for the Ionosphere dataset. Here the accuracy
of the split that optimises the alignment (threshold number 158) is approximately
(a) (b)

Figure 2: Plot for Breast Cancer data (Gaussian kernel) (a) and Ionosphere data
(linear kernel) (b) of Amax/ilKIIF (straight line), .4(S, k, y) for y = sign(v maX - ()i)
(bottom curve), and the accuracy of y (middle curve) against threshold number i.
71.37%.
We can also use the overall approach to adapt the kernel to the data. For example
we can choose the kernel parameters so as to optimize Amax/IIKIIF. Then find
the first eigenvector, choose a threshold to maximise the alignment and output the
corresponding y.
The cost to the alignment of changing a label Yi is 2 Lj Yjk(Xi' xj)/IIKIIF , so that
if a point is isolated from the others, or if it is equally close to the two different
classes, then changing its label will have only a very small effect. On the other
hand, labels in strongly clustered points clearly contribute to the overall cost and
changing their label will alter the alignment significantly.
The method we have described can be viewed as projecting the data into a 1-
dimensional space and finding a threshold. The projection also implicitly sorts the
data so that points of the same class are nearby in the ordering. We discuss the
problem in the 2-class case. We consider embedding the set into the real line, so
as to satisfy a clustering criterion. The resulting Kernel matrix should appear as a
block diagonal matrix.
This problem has been addressed in the case of information retrieval in [1], and
also applied to assembling sequences of DNA. In those cases, the eigenvectors of the
Laplacian have been used, and the approach is called the Fiedler ordering. Although
the Fiedler ordering could be used here as well, we present here a variation based
on the simple kernel matrix.
Let the coordinate ofthe point Xi on the real line be v(i). Consider the cost function
Lij v(i)v(j)K(i,j). It is maximized when points with high similarity have the same
sign and high absolute value, and when points with different sign have low similarity.
The choice of coordinates v that optimizes this cost is the first eigenvector, and
hence by sorting the data according to the value of their entry in this eigenvector
one can hope to find a good permutation, that renders the kernel matrix block
diagonal. Figure 3 shows the results of this heuristic applied to the Breast cancer
dataset. The grey level indicates the size of the kernel entry. The figure on the left
is for the unsorted data, while that on the right shows the same plot after sorting.
The sorted figure clearly shows the effectivenesss of the method.

3.2 Optimising the cut-cost


For a fixed kernel matrix minimising the cut-cost corresponds to mlmmlsmg
Ly;#y; k( Xi, Xj), that is the sum of the kernel entries between points of two dif-
Figure 3: Gram matrix for cancer data, before and after permutation of data ac-
cording to sorting order of first eigenvector of K
ferent classes. Since we are dealing with normalized kernels, this also controls the
expected distance between them. ( )
We can express this quantity as "'"'
~ Kij ="21",",~ Kij - Y' Ky ="21 Y, Ly,
ydy; i,j
where L is the Laplacian matrix, defined as L = D-K, where D = diag(d l , ... , dm )
with di = '£';1 k(Xi , Xj). One would like to find y E {-l,+l}m so as to minimize
the cut cost subject to the division being even, but this problem is NP-hard. Fol-
lowing the same strategy as with the alignment we can impose a slightly looser
constraint on y, y E Rm, '£i yt = m, l:i Yi = O. This gives the problem

min y' Ly subject to y E Rm , l: yt = m, l: Yi = O.


Since, zero is an eigenvalue of L with eigenvector j, the all ones vector, the problem
is equivalent to finding the eigenvector of the smallest non-zero eigenvalue ..\
minO#y l..j y/yY. Hence, this eigenvalue ..\ provides a lower bound on the cut cost
. ..\
mm
y E { - l,l}'"
C(S, k, y) ~ 2 IIKII F .
So the eigenvector corresponding to the eigenvalue ..\ of the Laplacian can be used
to obtain a good approximate split and ..\ gives a lower bound on the cut-cost. One
can now threshold the entries of the eigenvector in order to obtain a vector with
-1 and + 1 entries. We again plot the lower bound, cut-cost, and error rate as a
function of the threshold.
We applied the procedure to the Breast cancer data with both linear and Gaussian
kernels. The results are shown in Figure 4. Now using the cut cost to select
the best threshold for the linear kernel sets it at 378 with an accuracy of 67.86%,
significantly worse than the results obtained by optimising the alignment. With
the Gaussian kernel, on the other hand, the method selects threshold 312 with an
accuracy of 80.31 %, a slight improvement over the results obtained with this kernel
by optimising the alignment.
So far we have presented algorithms that use unsupervised data. We now consider
the situation where we are given a partially labelled dataset. This leads to a sim-
ple algorithm for transduction or semi-supervised learning. The idea that some
labelled data might improve performance comes from observing Figure 4b, where
the selection based on the cut-cost is clearly suboptimal. By incorporating some
label information, it is hoped that we can obtain an improved threshold selection.
0050'---------c:c------c=--=--~-_=_-_=_-~

(a) (b)
Figure 4: Plot for Breast Cancer data using (a) Linear kernel) and (b) Gaussian
kernel of C(S,k,y) - ,X/(21IKIIF) (dashed curves), for y = sign(v maX - ()i) and the
error of y (solid curve) against threshold number i.
Let z be the vector containing the known labels and 0 elsewhere. Set K P =
K + Cozz', where Co is a positive constant parameter. We now use the original
matrix K to generate the eigenvector, but the matrix K P when measuring the
cut-cost of the classifications generated by different thresholds. Taking Co = 1
we performed 5 random selections of 20% of the data and obtained a mean success
rate of 85.56% (standard deviation 0.67%) for the Breast cancer data with Gaussian
kernel, a marked improvement over the 80.31 % achieved with no label information.

4 Conclusions
The paper has considered two partition costs the first derived from the so-called
alignment of a kernel to a label vector, and the second from the cut-cost of a label
vector for a given kernel matrix. The two quantities are both optimised by the
same labelling, but give rise to different approximation algorithms when the discrete
constraint is removed from the labelling vector. It was shown how these relaxed
problems are solved exactly using spectral t echniques, hence leading to two distinct
approximation algorithms through a post-processing phase that re-discretises the
vector to create a labelling that is chosen to optimise the given criterion.
Experiments are presented showing the performance of both of these clustering
techniques with some very striking results. For the second algorithm we also gave
one preliminary experiment with a transductive version that enables some labelled
data to further refine the clustering.

References
[1] M.W. Berry, B. Hendrickson, and P. Raghavan. Sparse matrix reordering schemes for
browsing hypertext. In Th e Mat ematics of Num erical Analysis, pages 99- 123. AMS,
1996.
[2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines.
Cambridge University Press, 2000. See also the web site www.support-vector.net.
[3] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandola. On kernel-
target alignment. In submitted to Proceedings of Neural Information Processing Systems
(NIPS), 200l.
[4] Nello Cristianini, Huma Lodhi, and John Shawe-Taylor. Latent semantic kernels
for feature selection. Technical Report NC-TR-00-080, NeuroCOLT Working Group,
http://www.neurocolt.org, 2000.
[5] A. Pothen, H. Simon , and K. Liou. Partitioning sparse matrices with eigenvectors of
graphs. SIAM J. Matrix Anal., 11(3):430- 452, 1990.

Common questions

Powered by AI

The experiments on spectral clustering highlight that different kernel matrices and threshold strategies lead to varying effectiveness in clustering. A threshold optimized for alignment showed improved accuracy in datasets like Breast Cancer. Conversely, threshold choice based on cut-cost provides different results, e.g., higher accuracy with Gaussian kernels. These findings show that threshold selection strategy, kernel type, and their alignment with underlying data distributions critically influence clustering success .

Spectral methods' relationship with sparse matrix reordering in hypertext browsing is rooted in organizing data to enhance retrieval efficiency. The eigenvectors, as in spectral clustering, are instrumental in reordering matrices for better structure, thus improving browsing performance. This principle is used in projects where ordering or partitioning data focuses on semantic relationships or relevance, as evidenced in hypertext browsing improvements .

The first eigenvector is significant in spectral clustering as it is used to compute alignment or cut-cost measures. It aids in selecting a threshold that partitions the data with maximum alignment or minimal cut-cost. Selecting the right threshold is crucial as it affects the quality of the clustering solution, by deriving the signs of data points based on their constituents in the feature space .

Gaussian and linear kernels are compared to assess the impact of kernel choice on clustering performance. While the Gaussian kernel can capture more complex patterns with its non-linear capabilities, the linear kernel provides a simpler yet effective option. Comparing their results provides insights into how kernel selection influences accuracy and alignment within specific datasets, and guides practitioners in choosing the optimal kernel for their data .

Labeled data in spectral methods enhances clustering accuracy by guiding the threshold selection processes. With partial labeling, the algorithm can refine the cut-cost metric, producing a more representative class division. Techniques like transduction utilize labeled points to better estimate the split proportion, aligning the solution closer to the true class boundaries and improving accuracy .

Minimizing cut-cost is challenging due to its NP-hard nature when requiring an even division of data. This is addressed by relaxing the constraints, allowing solutions in continuous space (y ∈ R^m) instead of binary. This relaxed problem is solved using spectral techniques, particularly by computing the eigenvector corresponding to the smallest non-zero Laplacian eigenvalue, which approximates the optimal cut-cost and guides effective clustering .

Alignment and cut-cost are crucial for evaluating how well a data clustering solution performs. Alignment measures the correlation between data clusters and labels and can be optimized by adjusting the threshold in the feature space. Cut-cost assesses the sum of kernel entries between different cluster classes and is minimized to ensure effective clustering. Both concepts help in tuning kernel parameters for optimal data partitioning with spectral methods .

The Fiedler ordering, derived from Laplacian eigenvectors, structures data such that similar or related points are closely aligned, effectively producing a block diagonal kernel matrix. This facilitates identifying clusters in datasets where points of the same class align together post-ordering, thus making it useful for applications requiring intuitive data separations or sequence assembly, like DNA clustering .

Eigenvectors in spectral methods facilitate clustering by leveraging the intrinsic structure of the data without explicitly requiring labeled information. The method maximizes alignment and minimizes cut-cost, providing a natural separation based on data's spectral properties. By relaxing constraints to eigenvalue optimization, an efficient approximate solution for label assignment is achieved without the need for explicit labels .

Spectral kernel methods use the eigenvectors of a kernel matrix or closely related matrices to optimize clustering tasks. These eigenvectors implicitly define a high-dimensional feature space, allowing for data partitioning by maximizing or optimizing specific cost functions, such as alignment and cut-cost. The eigenvectors provide information to determine how data points can be assigned to clusters effectively .

You might also like