Optim Eng (2010) 11: 145157 DOI 10.
1007/s11081-008-9057-z
Clustering and feature selection using sparse principal component analysis
Ronny Luss Alexandre dAspremont
Received: 30 March 2007 / Accepted: 13 October 2008 / Published online: 6 November 2008 Springer Science+Business Media, LLC 2008
Abstract In this paper, we study the application of sparse principal component analysis (PCA) to clustering and feature selection problems. Sparse PCA seeks sparse factors, or linear combinations of the data variables, explaining a maximum amount of variance in the data while having only a limited number of nonzero coefcients. PCA is often used as a simple clustering technique and sparse factors allow us here to interpret the clusters in terms of a reduced set of variables. We begin with a brief introduction and motivation on sparse PCA and detail our implementation of the algorithm in dAspremont et al. (SIAM Rev. 49(3):434448, 2007). We then apply these results to some classic clustering and feature selection problems arising in biology. Keywords Sparse principal component analysis Semidenite programming Clustering Feature selection 1 Introduction This paper focuses on applications of sparse principal component analysis to clustering and feature selection problems, with a particular focus on gene expression data analysis. Sparse methods have had a signicant impact in many areas of statistics, in particular regression and classication (see Cands and Tao 2005; Donoho and Tanner 2005 and Vapnik 1995 among others). As in these areas, our motivation for developing sparse multivariate visualization tools is the potential of these methods for yielding statistical results that are both more interpretable and more robust than classical analysis, while giving up little statistical efciency.
R. Luss A. dAspremont ( ) ORFE Department, Princeton University, Princeton, NJ 08544, USA e-mail: [email protected] R. Luss e-mail: [email protected]
146
R. Luss, A. dAspremont
Principal component analysis (PCA) is a classic tool for analyzing large scale multivariate data. It seeks linear combinations of the data variables (often called factors or principal components) that capture a maximum amount of variance. Numerically, PCA only amounts to computing a few leading eigenvectors of the datas covariance matrix, so it can be applied to very large scale data sets. One of the key shortcomings of PCA however is that these factors are linear combinations of all variables; that is, all factor coefcients (or loadings) are non-zero. This means that while PCA facilitates model interpretation and visualization by concentrating the information in a few key factors, the factors themselves are still constructed using all observed variables. In many applications of PCA, the coordinate axes have a direct physical interpretation; in nance or biology for example, each axis might correspond to a specic nancial asset or gene. In such cases, having only a few nonzero coefcients in the principal components would greatly improve the relevance and interpretability of the factors. In sparse PCA, we seek a trade-off between the two goals of expressive power (explaining most of the variance or information in the data) and interpretability (making sure that the factors involve only a few coordinate axes or variables). When PCA is used as a clustering tool, sparse factors will allow us to identify the clusters with the action of only a few variables. Earlier methods to produce sparse factors include Cadima and Jolliffe (1995) where the loadings with smallest absolute value are thresholded to zero and nonconvex algorithms called SCoTLASS by Jolliffe et al. (2003), SLRA by Zhang et al. (2002, 2004) and SPCA by Zou et al. (2006). This last method works by writing PCA as a regression-type optimization problem and applies LASSO (Tibshirani 1996), a penalization technique based on the l1 norm. Very recently, Moghaddam et al. (2006a, 2006b) also proposed a greedy approach which seeks globally optimal solutions on small problems and uses a greedy method to approximate the solution of larger ones. In what follows, we give a brief introduction to the relaxation of this problem in dAspremont et al. (2007) and describe how this smooth optimization algorithm was implemented. The most expensive numerical step in this algorithm is the computation of the gradient as a matrix exponential and our key numerical contribution here is to show that using only a partial eigenvalue decomposition of the current iterate can produce a sufciently precise gradient approximation while drastically improving computational efciency. We then show on classic gene expression data sets that using sparse PCA as a simple clustering tool isolates very relevant genes compared to other techniques such as recursive feature elimination or ranking. The paper is organized as follows. In Sect. 2, we begin with a brief introduction and motivation on sparse PCA and detail our implementation of the algorithm in a numerical toolbox called DSPCA, which is available for download on the authors websites. In Sect. 3, we describe the application of sparse PCA to clustering and feature selection on gene expression data.
2 Algorithm and implementation In this section, we begin by introducing the sparse PCA problem and the corresponding semidenite relaxation derived in dAspremont et al. (2007). We then discuss
Clustering and feature selection using sparse principal component
147
how to use a partial eigenvalue decomposition of the current iterate to produce a sufciently precise gradient approximation and improve computational efciency. 2.1 Sparse PCA and semidenite relaxation Given a covariance matrix A Sn , where Sn is the set of symmetric matrices of dimension n, the problem of nding a sparse factor which explains a maximum amount of variance in the data can be written: maximize x T Ax subject to x 2 = 1, Card(x) k, (1)
in the variable x Rn where Card(x) denotes the cardinality of x and k > 0 is a parameter controlling this cardinality. Computing sparse factors with maximum variance is a combinatorial problem and is numerically hard (dAspremont et al. 2007) and use semidenite relaxation techniques to compute approximate solutions efciently by solving the following convex program: maximize Tr(AX) subject to Tr(X) = 1, 1T |X|1 k, X 0,
(2)
which is a semidenite program in the variable X Sn , where 1T |X|1 = ij |Xij | can be seen as a convex lower bound on the function Card(X). When the solution of the above problem has rank one, we have X = xx T where x is an approximate solution to (1). When the solution of this relaxation is not rank one, we use the leading eigenvector of X as a principal component. While small instances of problem (2) can be solved efciently using interior point semidenite programming solvers such as SEDUMI by Sturm (1999), the O(n2 ) linear constraints make these solvers inefcient for reasonably large instances. Furthermore, interior point methods are geared towards solving small problems with high precision requirements, while here we need to solve very large instances with relatively low precision. In dAspremont et al. (2007) it was shown that a smoothing technique due to Nesterov (2005) could be applied to problem (2) to get a complexity estimate of O(n4 log n/ ) and a much lower memory cost per iteration. The key numerical step in this algorithm is the computation of a smooth approximation of problem (2) and the gradient of the objective, which amounts to computing a matrix exponential. 2.2 Implementation Again, given a covariance matrix A Sn , the DSPCA code solves a penalized formulation of problem (2), written as: maximize Tr(AX) 1T |X|1 subject to Tr(X) = 1, X 0, (3)
148
R. Luss, A. dAspremont
in the variable X Sn . The dual of this program can be written as: minimize f (U ) = max (A + U ) subject to |Uij | . (4)
in the variable U Sn . The algorithm in dAspremont et al. (2007) regularizes the objective f (U ) in (4), replacing it by the smooth (i.e. with Lipschitz continuous gradient) uniform approximation: f (U ) = log (Tr exp((A + U )/)) log n, whose gradient can be computed explicitly as: f (U ) := exp((A + U )/)/(Tr exp((A + U )/)). Following Nesterov (1983), solving the smooth problem:
{U S n ,|Uij |}
min
f (U )
with = /2 log(n) then produces an -approximate solution to (3) and requires O n log n (5)
iterations. The main step at each iteration is computing the matrix exponential exp((A + U )/). This is a classic problem in linear algebra (see Moler and Van Loan 2003 for a comprehensive survey) and in what follows, we detail three different methods implemented in DSPCA and their relative performance. Full eigenvalue decomposition An exact computation of the matrix exponential can be done through a full eigenvalue decomposition of the matrix. Given the spectral decomposition V DV T of a matrix A where the columns of V are the eigenvectors and D is a diagonal matrix comprised of the eigenvalues (di )n , the matrix exponential i=1 can be computed as: exp(A) = V H V T , where H is the diagonal matrix with (hi = edi )n on its diagonal. While this is a i=1 simple procedure, it is also relatively inefcient. Pad approximation The next method implemented in DSPCA is called Pad approximation and approximates the exponential by a rational function. The (p, q) Pad approximation for exp(A) is dened by (see Moler and Van Loan 2003): Rpq (A) = [Dpq ]1 Npq (A), (6)
Clustering and feature selection using sparse principal component
149
where
p
Npq (A) =
j =0 q
(p + q j )!p! Aj (p + q)!j !(p j )!
and
Dpq (A) =
j =0
(p + q j )!q! (A)j . (p + q)!j !(q j )!
Here p and q control the degree and precision of the approximation and we set p = q = 6 (we set p = q in practice due to computational issues; see Moler and Van Loan (2003)). The approximation is only valid in a small neighborhood of zero, which means that we need to scale down the matrix before approximating its exponential using (6), and then scale it back to its original size. This scaling and squaring can be done efciently using the property that eA = (eA/m )m . We rst scale the matrix A so 1 that m A 106 and nd the smallest integer s such that this is true for m = 2s . We then use the Pad approximation to compute eA , and simply square the result s times to scale it back. Pad approximation only requires computing one matrix inversion and several matrix products which can be done very efciently. However, if n or s get somewhat large, scaling and squaring can be costly, in which case a full eigenvalue decomposition has better performance. While Pad approximations appear to be the current method of choice for computing matrix exponentials (see Moler and Van Loan 2003), it does not perform as well as expected on our problems compared to partial eigenvalue decomposition discussed below, because of the particular structure of our optimization problem. Numerical results illustrating this issue are detailed in the last section. Partial eigenvalue decomposition The rst two classic methods we described for computing the exponential of a matrix are both geared towards producing a solution up to (or close to) machine precision. In most of the sparse PCA problems we solve here, the target precision for the algorithm is of the order 101 . Computing the gradient up to machine precision in this context is unnecessarily costly. In fact, dAspremont (2005) shows that the optimal convergence of the algorithm in Nes terov (2005) can be achieved using approximate gradient values f (U ), provided the error satises the following condition: | f (U ) f (U ), Y | , |Uij |, |Yij | , i, j = 1, . . . , n, (7)
where is a parameter controlling the approximation error. In practice, this means that we only need to compute a few leading eigenvalues of the matrix (A + U )/ to get a sufcient gradient approximation. More precisely, if we denote by Rn the eigenvalues of (A + U )/, condition (7) can be used to show that, to get convergence, we need only compute the j largest eigenvalues with j such that: (n j )ej (
j 2i i=1 e
n j ej
j i i=1 e
j i 2 i=1 e )
. n
(8)
150
R. Luss, A. dAspremont
Table 1 CPU time (in seconds) and explained variance for Sedumi and smooth optimization with partial eigenvalue decompositions (DSPCA). Var. measures the percentage of total variance explained by the sparse factors. DSPCA signicantly outperforms in terms of CPU time while reaching comparable variance targets Dim. 10 20 30 40 50 Nonzeros 3 8 11 14 15 DSPCA time 0.08 0.27 0.99 4.55 7.52 DSPCA var. 42.88% 45.67% 40.74% 40.26% 36.76% Sedumi time 0.39 5.29 41.99 229.44 903.36 Sedumi var. 43.00% 47.50% 42.44% 41.89% 38.53% PCA var. 49.36% 52.58% 51.13% 52.40% 50.24%
The terms on the left side decrease as j increases meaning the condition is satised by increasing the number of eigenvalues used. Computing the j largest eigenvalues of a matrix can be done very efciently using packages such as ARPACK if j n. When j becomes large, the algorithm switches to full eigenvalue decomposition. Finally, the leading eigenvalues tend to coalesce close to the optimum, potentially increasing the number of eigenvalues required at each iteration (see Pataki 1998 for example) but this phenomenon does not seem to appear at the low precision targets considered here. We detail empirical results on the performance of this technique in the following sections illustrating how this technique clearly dominates the two others for large scale problems. 2.3 Comparison with interior point solvers We give an example illustrating the necessity of a large-scale solver using a smooth approximation to problem (2). The Sedumi implementation to problem (2) runs out of memory for problem sizes larger than 60, so we compare the example on very small dimensions against our implementation with partial eigenvalue decomposition (DSPCA). The covariance matrix is formed using colon cancer gene expression data detailed in the following section. Table 1 shows running times for DSPCA and Sedumi on for various (small) problem dimensions. DSPCA clearly beats the interior point solver in computational time while achieving comparable precision (measured as the percentage of variance explained by the sparse factor). For reference, we show how much variation is explained by the leading principal component. The decrease in variance using Sedumi and DSPCA represents the cost of sparsity here.
3 Clustering and feature selection In this section, we use our code for sparse PCA (DSPCA), to analyze large sets of gene expression data and we discuss applications of this technique to clustering and feature selection. PCA is very often used as a simple tool for data visualization and clustering (see Srebro et al. 2006 for a recent analysis), here sparse factors allow us to interpret the low dimensional projection of the data in terms of only a few variables.
Clustering and feature selection using sparse principal component
151
Fig. 1 Performance of smooth optimization using exact gradient, Pad approximations and partial eigenvalue decomposition. CPU time versus problem dimension (left). Average percentage of eigenvalues required versus problem dimension (right) for partial eigenvalue decomposition, with dashed lines at plus and minus one standard deviation
3.1 Gene expression data We rst test the performance of DSPCA on covariance matrices generated from a gene expression data set from Alon et al. (1999) on colon cancer. The data set is comprised of 62 tissue samples, 22 from normal colon tissues and 40 from cancerous colon tissues with 2000 genes measured in each sample. We preprocess the data by taking the log of all data intensities and then normalize the log data of each sample to be mean zero and standard deviation one, which is a classic procedure to minimize experimental effects (see Huang and Kecman 2005). Since the semidenite programs produced by these data sets are relatively large (with n = 1000 after preprocessing), we rst test the numerical performance of the algorithm and run the code on increasingly large problems using each of the three methods described in Sect. 2.2 (full eigenvalue decomposition, Pad approximation and partial eigenvalue decomposition). Theoretically, the performance increase of using partial, rather than full, eigenvalue decompositions should be substantial when only a few eigenvalues are required. In practice there is overhead due to the necessity of testing condition (8) iteratively. Figure 1 depicts the results of these tests on a 3.0 GHz CPU in a loglog plot of runtime (in seconds) versus problem dimension (on the left). We plot the average number of eigenvalues required by condition (8) versus problem dimension (on the right), with dashed lines at plus and minus one standard deviation. We cannot include interior point algorithms in this comparison because memory problems occur for dimensions greater than 50. In these tests, partial eigenvalue is more than three times faster than the other methods for large problem sizes, meaning the approximate gradient is the dominating alternative compared to the exact gradient. The plot of the average percentage of eigenvalues for each test shows that on average only 23% of the eigenvalues are necessary. Given the gain performance, the main limitation becomes memory; the partial eigenvalue implementation has memory allocation problems at dimensions 1900 and larger on a computer with 3.62 Gb RAM. Notice also that in our experiments, computing the matrix exponential using Pad approximation is slower than performing
152
R. Luss, A. dAspremont
Fig. 2 Percentage of eigenvalues required versus CPU time (left), and percentage of eigenvalues required versus duality gap (right). The computational cost of an iteration increases with both target precision and target sparsity
a full eigenvalue decomposition, which reects the high numerical cost of matrix multiplications necessary for scaling and squaring in Pad approximations. Next, we test the impact of the sparsity target on computational complexity. We observe in (5) that the number of iterations required by the algorithm scales linearly with . Furthermore, as increases, condition (8) means that more eigenvalues are required to satisfy to prove convergence, which increases the computational effort. Figure 2 shows how computing time (left) and duality gap (right) vary with the number of eigenvalues required at each iteration (shown as the percentage of total eigenvalues). The four tests are all of dimension 1000 and vary in degrees of sparsity; for a xed data set at a xed point in the algorithm, more sparsity (i.e. higher and less genes) requires more eigenvalues as noted above. The increase in required eigenvalues is much more pronounced when viewed as the algorithm proceeds, measured by the increasing computing time (left) and decreasing duality gap (right). More eigenvalues are also required as the current iterate gets closer to the optimal solution. Note that, to reduce overhead, the code does not allow the number of required eigenvalues to decrease from one iteration to the next. Figure 2 shows as was expected that computational cost of an iteration increases with both target precision and target sparsity. 3.2 Clustering In this section, we compare the clustering (class discovery) performance of sparse PCA to that of standard PCA. We analyze the colon cancer data set of Alon et al. (1999) as well as a lymphoma data set from Alizadeh et al. (2000). The lymphoma data set consists of 3 classes of lymphoma denoted DLCL (diffuse large B-cell lymphoma), FL (follicular lymphoma) and CL (chronic lymphocytic leukemia). The top two plots of Fig. 3 display the clustering effects of using two factors on the colon cancer data while the bottom two plots of Fig. 3 display the results on the lymphoma data. On both data sets, clusters are represented using PCA factors on the left plots and sparse factors from DSPCA on the right plots. For the colon cancer data, the second factor has greater power in predicting the class of each sample, while for the
Clustering and feature selection using sparse principal component
153
Fig. 3 (Color online) Clustering: The top two graphs display the results on the colon cancer data set using PCA (left) and DSPCA (right). Normal patients are red circles and cancer patients are blue diamonds. The bottom two graphs display the results on the lymphoma data set using PCA (left) and DSPCA (right). For lymphoma, we denote diffuse large B-cell lymphoma as DLCL (red circles), follicular lymphoma as FL (blue diamonds), and chronic lymphocytic leukemia as CL (green squares)
lymphoma data, the rst factor classies DLCL and the second factor differentiates between CL and FL. In these examples, we observe that DSPCA maintains good cluster separation while requiring far fewer genes: a total of 13 instead of 1000 for the colon cancer data, and 108 genes instead of 1000 for the lymphoma data set. While the clustering remains visually clear, we now analyze and quantify the quality of the clusters derived from PCA and DSPCA numerically using the Rand index. We rst cluster the data (after reducing to two dimensions) using K-means clustering, and then use the Rand index to compare the partitions obtained from PCA and DSPCA to the true partitions. The Rand index measures the similarity between two partitions X and Y and is computed as the ratio R(X, Y ) = p+q
n 2
where p is the number of pairs of elements that are in the same cluster in both partitions X and Y (correct pairs), q is the number of pairs of elements in different clusters in both X and Y (error pairs), and n is the total number of element pairs. The Rand 2 index for varying levels of sparsity is plotted in Fig. 4. The Rand index of standard PCA is .654 for colon cancer (.804 for lymphoma) as marked in Fig. 4. The Rand
154
R. Luss, A. dAspremont
Fig. 4 Rand index versus sparsity: colon cancer (left) and lymphoma (right) Fig. 5 Separation versus sparsity for the lymphoma data set
index for the DSPCA factors of colon cancer using 13 genes is .669 and is the leftmost point above the PCA line. This shows that clusters derived using sparse factors can achieve equivalent performance to clusters derived from dense factors. However, DSPCA on the lymphoma data does not achieve a high Rand index with very sparse factors and it takes about 50 genes per factor to get good clusters. For lymphoma, we can also look at another measure of cluster validity. We measure the impact of sparsity on the separation between the true clusters, dened as the distance between the cluster centers. Figure 5 shows how this separation varies with the sparsity of the factors. The lymphoma clusters with 108 genes have a separation of 63, after which separation drops sharply. Notice that the separation of CL and FL is very small to begin with and the sharp decrease in separation is mostly due to CL and FL getting closer to DLCL. 3.3 Feature selection Using the sparse factor analysis derived above, our next objective is to track the action of the selected genes. See Guyon et al. (2002) for an example illustrating the
Clustering and feature selection using sparse principal component
155
possible relation of selected genes in this data set to colon cancer using Recursive Feature Elimination with Support Vector Machines (RFE-SVM) for feature selection. Another study in Huang and Kecman (2005) compares RFE-SVM results on the colon cancer data set to those genes selected by the Rankgene software in Su et al. (2003) on the colon cancer and lymphoma data sets. We implemented RFE-SVM and determined the regularization parameter using cross-validation. On the colon cancer data, the clustering results show that factor two contains the most predictive power. Table 2 shows 7 genes that appeared in the top 10 most important genes (ranked by magnitude) in the results of DSPCA and RFE-SVM. For comparison, we include ranks computed using another sparse PCA package: SPCA from Zou et al. (2006) with a sparsity level equal to the results of DSPCA (we use the function SPCA for colon cancer and arrayspca for lymphoma, both from Zou and Hastie 2005). DSPCA identies 2 genes from the top 10 genes of
Table 2 Feature selection on colon cancer data set. Columns 13 are respective ranks. 7 genes (out of 2000) were in the top 10 list of genes for both DSPCA (factor two) and RFE-SVM (with C = .005). 5 of the genes were top genes in SPCA DSPCA 2 4 5 7 8 9 10 RFE-SVM 1 3 6 4 9 2 7 SPCA 11 9 1 5 NA NA 6 GAN J02854 X86693 T92451 H43887 H06524 M63391 T47377 Description Myosin regulator light chain 2, smooth muscle isoform (human) H. sapiens mRNA for hevin like protein Tropomyosin, broblast and epithelial muscle-type (human) Complement factor D precursor (H. sapiens) Gelsolin precursor, plasma (human) Human desmin gene, complete cds S-100P Protein (human)
Table 3 Feature selection on lymphoma data set. Columns 13 are respective ranks. 5 genes (out of 4026) were in the top 15 list of genes for both DSPCA (factor one), RFE-SVM (classifying DLCL or not DLCL with C = .001), and SPCA (factor one). DSPCA and SPCA have an additional 2 top genes not found in the top 15 genes of RFE-SVM DSPCA RFE-SVM SPCA GAN 1 2 3 4 1 NA 14 NA 1 6 2 9 GENE1636X GENE1637X GENE1641X GENE1638X Description *Fibronectin 1; Clone = 139009 Cyclin D2/KIAK0002 = overlaps with middle of KIAK0002 cDNA; Clone = 359412 *Fibronectin 1; Clone = 139009 MMP-2 = Matrix metalloproteinase 2 = 72 kD type IV collagenase precursor = 72 kD gelatinase = gelatinase A = TBE1; Clone = 323656 6 10 15 2 3 8 3 4 8 GENE1610X GENE1648X GENE1647X *Mig = Humig = chemokine targeting T cells; Clone = 8 *cathepsin B; Clone = 297219 *cathepsin B; Clone = 261517
156
R. Luss, A. dAspremont
RFE-SVM that are not identied by SPCA. After removing the true rst component which explains 58.1% of the variation, DSPCA, SPCA, and PCA explain 1.4%, 1.1%, and 7.2% of the remaining variation respectively. Since the rst factor of the lymphoma DSPCA results classies DLCL, we combine FL and CL into a single group for RFE-SVM and compare the results to factor one of DSPCA. Table 3 shows 5 genes that appeared in the top 15 most important genes of all three feature selection methods. An additional 2 genes were discovered by DSPCA and SPCA but not by RFE-SVM. DSPCA, SPCA, and PCA explain 13.5%, 9.7%, and 28.5% of the variation respectively, providing another example where DSPCA maintains more of the clustering effects than SPCA given equal sparsity. Both RFE-SVM and SPCA are faster than DSPCA, but RFE-SVM does not produce factors in the sense of PCA and SPCA is nonconvex and has no convergence guarantees.
4 Conclusion We showed that efcient approximations of the gradient allowed large scale instances of the sparse PCA relaxation algorithm in dAspremont et al. (2007) to be solved at a moderate computational cost. This allowed us to apply sparse PCA to clustering and feature selection on two classic gene expression data sets. In both cases, sparse PCA efciently isolate relevant genes while maintaining most of the original clustering power of PCA.
Acknowledgements We are very grateful to the organizers of the BIRS workshop on Optimization and Engineering Applications where this work was presented. The authors would also like to acknowledge support from grants NSF DMS-0625352, Eurocontrol C20083E/BM/05 and a gift from Google, Inc.
References
Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A (2000) Distinct types of diffuse large b-cell lymphoma identied by gene expression proling. Nature 403:503511 Alon A, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Cell Biol 96:67456750 Cadima J, Jolliffe IT (1995) Loadings and correlations in the interpretation of principal components. J Appl Stat 22:203214 Cands EJ, Tao T (2005) Decoding by linear programming. IEEE Trans Inf Theory 51(12):42034215 dAspremont A (2005) Smooth optimization with approximate gradient. ArXiv:math.OC/0512344 dAspremont A, El Ghaoui L, Jordan MI, Lanckriet GRG (2007) A direct formulation for sparse PCA using semidenite programming. SIAM Rev 49(3):434448 Donoho DL, Tanner J (2005) Sparse nonnegative solutions of underdetermined linear equations by linear programming. Proc Natl Acad Sci 102(27):94469451 Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classication using support vector machines. Mach Learn 46:389422 Huang TM, Kecman V (2005) Gene extraction for cancer diagnosis by support vector machines-an improvement. Artif Intell Med 35:185194 Jolliffe IT, Trendalov NT, Uddin M (2003) A modied principal component technique based on the LASSO. J Comput Graph Stat 12:531547 Moler C, Van Loan C (2003) Nineteen dubious ways to compute the exponential of a matrix, twenty-ve years later. SIAM Rev 45(1):349
Clustering and feature selection using sparse principal component
157
Moghaddam B, Weiss Y, Avidan S (2006a) Generalized spectral bounds for sparse LDA. In: International conference on machine learning Moghaddam B, Weiss Y, Avidan S (2006b) Spectral bounds for sparse PCA: Exact and greedy algorithms. Adv Neural Inf Process Syst, 18 Nesterov Y (1983) A method of solving a convex programming problem with convergence rate O(1/k 2 ). Sov Math Dokl 27(2):372376 Nesterov Y (2005) Smooth minimization of non-smooth functions. Math Program 103(1):127152 Pataki G (1998) On the rank of extreme matrices in semidenite programs and the multiplicity of optimal eigenvalues. Math Oper Res 23(2):339358 Su Y, Murali TM, Pavlovic V, Schaffer M, Kasif S (2003) Rankgene: identication of diagnostic genes based on expression data. Bioinformatics 19:15781579 Srebro N, Shakhnarovich G, Roweis S (2006) An investigation of computational and informational limits in Gaussian mixture clustering. In: Proceedings of the 23rd international conference on machine learning, pp 865872 Sturm J (1999) Using SEDUMI 1.0x, a MATLAB toolbox for optimization over symmetric cones. Optim Methods Softw 11:625653 Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58(1):267288 Vapnik V (1995) The nature of statistical learning theory. Springer, Berlin Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67(2):301320 Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15(2):265286 Zhang Z, Zha H, Simon H (2002) Low rank approximations with sparse factors I: basic algorithms and error analysis. SIAM J Matrix Anal Appl 23(3):706727 Zhang Z, Zha H, Simon H (2004) Low rank approximations with sparse factors II: penalized methods with discrete Newton-like iterations. SIAM J Matrix Anal Appl 25(4):901920