0% found this document useful (0 votes)

88 views30 pages

Package PMA': R Topics Documented

This package implements penalized multivariate analysis methods, including sparse principal component analysis, sparse canonical correlation analysis, and a penalized matrix decomposition. The package contains functions for these methods as well as cross-validation functions for tuning parameter selection. It can be used for integrative analysis of genomic data like gene expression and DNA copy number.

Uploaded by

Hafiz Muhammad Zulqarnain Jamil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views30 pages

Package PMA': R Topics Documented

Uploaded by

Hafiz Muhammad Zulqarnain Jamil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Package ‘PMA’

April 20, 2018

Type Package
Title Penalized Multivariate Analysis
Version 1.0.11
Date 2018-04-20
Author Daniela Witten and Rob Tibshirani and Sam Gross and
Balasubramanian Narasimhan
Maintainer ORPHANED
Description Performs Penalized Multivariate Analysis: a penalized
matrix decomposition, sparse principal components analysis, and
sparse canonical correlation analysis, described in the
following papers: (1) Witten, Tibshirani and Hastie (2009) A
penalized matrix decomposition, with applications to sparse
principal components and canonical correlation analysis.
Biostatistics 10(3):515-534. (2) Witten and Tibshirani (2009)
Extensions of sparse canonical correlation analysis, with
applications to genomic data. Statistical Applications in
Genetics and Molecular Biology 8(1): Article 28.
License GPL (>= 2)
Depends R (>= 2.10), impute
NeedsCompilation yes
Repository CRAN
Date/Publication 2018-04-20 07:47:39 UTC
X-CRAN-Original-Maintainer Daniela Witten <[email protected]>
X-CRAN-Comment Orphaned and corrected on 2018-04-20 as check problems
were not corrected despite multiple notifications.
Including C++ misuse of = for == and non-registration of S3 methods.

R topics documented:
PMA-package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
breastdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1
2 PMA-package

CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
CCA.permute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
MultiCCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
MultiCCA.permute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
PlotCGH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
PMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
PMD.cv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
SPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
SPC.cv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Index 30

PMA-package Penalized Multivariate Analysis

Description
This package is called PMA, for "Penalized Multivariate Analysis". It implements three methods:
A penalized matrix decomposition, sparse principal components analysis, and sparse canonical cor-
relations analysis. All are described in the paper "A penalized matrix decomposition, with applica-
tions to sparse principal components and canonical correlation analysis", by D Witten, R Tibshirani,
and T Hastie, and published in Biostatistics (2009).
The main functions are as follows: (1) PMD (2) CCA (3) SPC
The first function, PMD, performs a penalized matrix decomposition. CCA performs sparse canon-
ical correlation analysis. SPC performs sparse principal components analysis.
There also are cross-validation functions for tuning parameter selection for each of the above meth-
ods: SPC.cv, PMD.cv, CCA.permute. And PlotCGH results in nice plots for DNA copy number
data.

Details

Package: PMA
Type: Package
Version: 1.0.9
Date: 2013-03-23
License: GPL >= 2

Author(s)

Daniela M. Witten and Robert Tibshirani and Sam Gross and Balasubramanian Narasimhan
breastdata 3

References
Witten, Tibshirani and Hastie (2009) A penalized matrix decomposition, with applications to sparse
principal components and canonical correlation analysis. Biostatistics 10(3): 515-534.

breastdata Breast cancer gene expression + DNA copy number data set from Chin
et al (2006), Cancer Cell

Description
This data set consists of gene expression and DNA copy number measurements on a set of 89
samples. This example is used in Witten, Tibshirani and Hastie (2008).

Usage
data(breastdata)

Format
The format is a list containing the following elements: - dna: a 2149x89 matrix of CGH spots x
Samples - rna: a 19672x89 matrix of Genes x Samples - chrom: a 2149-vector of chromosomal
location of each CGH spot - nuc: a 2149-vector of nucleotide position for each CGH spot - gene:
a 19672-vector wiith an accession number for each gene - genenames: a 19672-vector with a name
for each gene - genechr: a 19672-vector with a chromosomal location for each gene - genedesc: a
19672-vector with a description for each gene - genepos: a 19672-vector with a nucleotide position
for each gene

Details
This data set can be used to perform integrative analysis of gene expression and DNA copy number
data, as in e.g. Witten and Tibshirani and Hastie (2008). That is, we can look for sets of genes that
are associated with regions of chromosomal gain/loss.
Missing values were imputed using 5-nearest neighbors (see library pamr).

Source
This data set was published in the following paper:
Chin, K., DeVries, S., Fridlyand, J., Spellman, P., Roydasgupta, R., Kuo,W.-L., Lapuk, A., Neve,
R., Qian, Z., Ryder, T., Chen, F., Feiler, H., Tokuyasu, T., Kingsley, C., Dairkee, S., Meng, Z., Chew,
K., Pinkel, D., Jain, A., Ljung, B., Esserman, L., Albertson, D.,Waldman, F. & Gray, J. (2006),
’Genomic and transcriptional aberrations linked to breast cancer pathophysiologies’, Cancer Cell
10, 529-541.
It is publicly available at http://icbp.lbl.gov/breastcancer .
4 CCA

References
Chin, DeVries, Fridlyand, et al. (2006) Cancer Cell 10, 529-541.
Used as an example in Witten, Tibshirani and Hastie (2008) ’A penalized matrix decomposition,
with applications to sparse principal components and canonical correlation analysis.’

Examples
data(breastdata)
attach(breastdata)
PlotCGH(dna[,1], chrom=chrom, main="Sample 1", nuc=nuc)
detach(breastdata)

CCA Perform sparse canonical correlation analysis using the penalized ma-
trix decomposition.

Description
Given matrices X and Z, which represent two sets of features on the same set of samples, find sparse
u and v such that u’X’Zv is large. For X and Z, the samples are on the rows and the features are on
the columns. X and Z must have same number of rows, but may (and usually will) have different
numbers of columns. The columns of X and/or Z can be unordered or ordered. If unordered, then
a lasso penalty will be used to obtain the corresponding canonical vector. If ordered, then a fused
lasso penalty will be used; this will result in smoothness.

Usage
CCA(x, z, typex=c("standard", "ordered"),typez=c("standard","ordered"),
penaltyx=NULL, penaltyz=NULL, K=1, niter=15, v=NULL, trace=TRUE,
standardize=TRUE, xnames=NULL, znames=NULL, chromx=NULL, chromz=NULL,
upos=FALSE, uneg=FALSE, vpos=FALSE, vneg=FALSE, outcome=NULL,
y=NULL, cens=NULL)
## S3 method for class 'CCA'
print(x,verbose=FALSE,...)

Arguments
x Data matrix; samples are rows and columns are features. Cannot contain missing
values.
z Data matrix; samples are rows and columns are features. Cannot contain missing
values.
typex Are the columns of x unordered (type="standard") or ordered (type="ordered")?
If "standard", then a lasso penalty is applied to u, to enforce sparsity. If "ordered"
(generally used for CGH data), then a fused lasso penalty is applied, to enforce
both sparsity and smoothness.
CCA 5

typez Are the columns of z unordered (type="standard") or ordered (type="ordered")?

If "standard", then a lasso penalty is applied to v, to enforce sparsity. If "ordered"
(generally used for CGH data), then a fused lasso penalty is applied, to enforce
both sparsity and smoothness.
penaltyx The penalty to be applied to the matrix x, i.e. the penalty that results in the
canonical vector u. If typex is "standard" then the L1 bound on u is penal-
tyx*sqrt(ncol(x)). In this case penaltyx must be between 0 and 1 (larger L1
bound corresponds to less penalization). If "ordered" then it’s the fused lasso
penalty lambda, which must be non-negative (larger lambda corresponds to
more penalization).
penaltyz The penalty to be applied to the matrix z, i.e. the penalty that results in the
canonical vector v. If typez is "standard" then the L1 bound on v is penal-
tyz*sqrt(ncol(z)). In this case penaltyz must be between 0 and 1 (larger L1
bound corresponds to less penalization). If "ordered" then it’s the fused lasso
penalty lambda, which must be non-negative (larger lambda corresponds to
more penalization).
K The number of u’s and v’s desired; that is, the number of canonical vectors to be
obtained.
niter How many iterations should be performed? Default is 15.
v The first K columns of the v matrix of the SVD of X’Z. If NULL, then the SVD
of X’Z will be computed inside the CCA function. However, if you plan to run
this function multiple times, then save a copy of this argument so that it does not
need to be re-computed (since that process can be time-consuming if X and Z
both have high dimension).
trace Print out progress?
standardize Should the columns of x and z be centered (to have mean zero) and scaled (to
have standard deviation 1)? Default is TRUE.
xnames An optional vector of column names for x.
znames An optional vector of column names for z.
chromx Used only if typex is "ordered"; allows user to specify a vector of length ncol(x)
giving the chromosomal location of each CGH spot. This is so that smoothness
will be enforced within each chromosome, but not between chromosomes.
chromz Used only if typez is "ordered"; allows user to specify a vector of length ncol(z)
giving the chromosomal location of each CGH spot. This is so that smoothness
will be enforced within each chromosome, but not between chromosomes.
upos If TRUE, then require elements of u to be positive. FALSE by default. Can only
be used if type is "standard".
uneg If TRUE, then require elements of u to be negative. FALSE by default. Can only
be used if type is "standard".
vpos If TRUE, require elements of v to be positive. FALSE by default. Can only be
used if type is "standard".
vneg If TRUE, require elements of v to be negative. FALSE by default. Can only be
used if type is "standard".
6 CCA

outcome If you would like to incorporate a phenotype into CCA analysis - that is, you
wish to find features that are correlated across the two data sets and also corre-
lated with a phenotype - then use one of "survival", "multiclass", or "quantita-
tive" to indicate outcome type. Default is NULL.
y If outcome is not NULL, then this is a vector of phenotypes - one for each row
of x and z. If outcome is "survival" then these are survival times; must be non-
negative. If outcome is "multiclass" then these are class labels (1,2,3,...). Default
NULL.
cens If outcome is "survival" then these are censoring statuses for each observation.
1 is complete, 0 is censored. Default NULL.
... not used.
verbose not used.

Details
This function is useful for performing an integrative analysis of two sets of measurements taken
on the same set of samples: for instance, gene expression and CGH measurements on the same set
of patients. It takes in two data sets, called x and z, each of which have (the same set of) samples
on the rows. If z is a matrix of CGH data with *ordered* CGH spots on the columns, then use
typez="ordered". If z consists of unordered columns, then use typez="standard". Similarly for
typex.
This function performs the penalized matrix decomposition on the data matrix $X’Z$. There-
fore, the results should be the same as running the PMD function on t(x)%*% z. However, when
ncol(x)»nrow(x) and ncol(z)»nrow(z) then using the CCA function is much faster because it avoids
computation of $X’Z$.
The CCA criterion is as follows: find unit vectors $u$ and $v$ such that $u’X’Zv$ is maximized
subject to constraints on $u$ and $v$. If typex="standard" and typez="standard" then the constraints
on $u$ and $v$ are lasso ($L_1$). If typex="ordered" then the constraint on $u$ is a fused lasso
penalty (promoting sparsity and smoothness). Similarly if typez="ordered".
When type x is "standard": the L1 bound of u is penaltyx*sqrt(ncol(x)).
When typex is "ordered": penaltyx controls the amount of sparsity and smoothness in u, via the
fused lasso penalty: $lambda sum_j |u_j| + lambda sum_j |u_j - u_(j-1)|$. If NULL, then it will be
chosen adaptively from the data.

Value
u u is output. If you asked for multiple factors then each column of u is a factor. u
has dimension nxK if you asked for K factors.
v v is output. If you asked for multiple factors then each column of v is a factor. v
has dimension pxK if you asked for K factors.
d A vector of length K, which can alternatively be computed as the diagonal of
the matrix $u’X’Zv$.
v.init The first K factors of the v matrix of the SVD of x’z. This is saved in case this
function will be re-run later.
CCA 7

Author(s)
Daniela M. Witten and Robert Tibshirani

References
Witten, DM and Tibshirani, R and T Hastie (2008) A penalized matrix decomposition, with applica-
tions to sparse principal components and canonical correlation analysis. Submitted. <http://www-
stat.stanford.edu/~dwitten>

# Not run, to save time:

## Not run:
## Now try CCA with a constraint that elements of u must be negative and
## elements of v must be positive:
perm.out <- CCA.permute(x,z,typex="standard",typez="standard",nperms=7,
penaltyxs=seq(.1,.7,len=10), penaltyzs=seq(.1,.7,len=10), uneg=TRUE, vpos=TRUE)
print(perm.out)
plot(perm.out)
out <- CCA(x,z,typex="standard",typez="standard",K=1,
penaltyx=perm.out$bestpenaltyx,penaltyz=perm.out$bestpenaltyz,
v=perm.out$v.init, uneg=TRUE, vpos=TRUE)
print(out)

## Suppose we also have a quantitative outcome, y, and we want to find

8 CCA.permute

## features in x and z that are correlated with each other and with the
## outcome:
y <- rnorm(nrow(x))
perm.out <- CCA.permute(x,z,typex="standard",typez="standard",
outcome="quantitative",y=y, nperms=6)
print(perm.out)
out<-CCA(x,z,typex="standard",typez="standard",outcome="quantitative",
y=y,penaltyx=perm.out$bestpenaltyx,penaltyz=perm.out$bestpenaltyz)
print(out)

## now, do CCA with type="ordered"

## Example involving the breast cancer data: gene expression + CGH
set.seed(22)
data(breastdata)
attach(breastdata)
dna <- t(dna)
rna <- t(rna)
perm.out <- CCA.permute(x=rna,z=dna[,chrom==1],typex="standard",
typez="ordered",nperms=5,penaltyxs=seq(.02,.7,len=10))
## We run CCA using all gene exp. data, but CGH data on chrom 1 only.
print(perm.out)
plot(perm.out)
out <- CCA(x=rna,z=dna[,chrom==1], typex="standard", typez="ordered",
penaltyx=perm.out$bestpenaltyx,
v=perm.out$v.init, penaltyz=perm.out$bestpenaltyz,
xnames=substr(genedesc,1,20),
znames=paste("Pos", sep="", nuc[chrom==1]))
# Save time by inputting lambda and v
print(out) # could do print(out,verbose=TRUE)
print(genechr[out$u!=0]) # Cool! The genes associated w/ gain or loss
## on chrom 1 are located on chrom 1!!
par(mfrow=c(1,1))
PlotCGH(out$v, nuc=nuc[chrom==1], chrom=chrom[chrom==1],
main="Regions of gain/loss on Chrom 1 assoc'd with gene expression")
detach(breastdata)

## End(Not run)

CCA.permute Select tuning parameters for sparse canonical correlation analysis us-
ing the penalized matrix decomposition.

Description
This function can be used to automatically select tuning parameters for sparse CCA using the pe-
nalized matrix decompostion. For each data set x and z, two types are possible: (1) type "standard",
which does not assume any ordering of the columns of the data set, and (2) type "ordered", which as-
sumes that columns of the data set are ordered and thus that corresponding canonical vector should
be both sparse and smooth (e.g. CGH data).
For X and Z, the samples are on the rows and the features are on the columns.
CCA.permute 9

The tuning parameters are selected using a permutation scheme. For each candidate tuning param-
eter value, the following is performed: (1) The samples in X are randomly permuted nperms times,
to obtain matrices $X*_1,X*_2,...$. (2) Sparse CCA is run on each permuted data set $(X*_i,Z)$
to obtain factors $(u*_i, v*_i)$. (3) Sparse CCA is run on the original data (X,Z) to obtain factors
u and v. (4) Compute $c*_i=cor(X*_i u*_i,Z v*_i)$ and $c=cor(Xu,Zv)$. (5) Use Fisher’s trans-
formation to convert these correlations into random variables that are approximately normally dis-
tributed. Let Fisher(c) denote the Fisher transformation of c. (6) Compute a z-statistic for Fisher(c),
using $(Fisher(c)-mean(Fisher(c*)))/sd(Fisher(c*))$. The larger the z-statistic, the "better" the cor-
responding tuning parameter value.
This function also gives the p-value for each pair of canonical variates (u,v) resulting from a given
tuning parameter value. This p-value is computed as the fraction of $c*_i$’s that exceed c (using
the notation of the previous paragraph).
Using this function, only the first left and right canonical variates are considered in selection of the
tuning parameter.

Usage
CCA.permute(x,z,typex=c("standard", "ordered"),typez=c("standard","ordered"),
penaltyxs=NULL, penaltyzs=NULL,
niter=3,v=NULL,trace=TRUE,nperms=25, standardize=TRUE, chromx=NULL,
chromz=NULL,upos=FALSE, uneg=FALSE, vpos=FALSE, vneg=FALSE,outcome=NULL,
y=NULL, cens=NULL)
## S3 method for class 'CCA.permute'
plot(x,...)
## S3 method for class 'CCA.permute'
print(x,...)

Arguments
x Data matrix; samples are rows and columns are features.
z Data matrix; samples are rows and columns are features. Note that x and z
must have the same number of rows, but may (and generally will) have different
numbers of columns.
typex Are the columns of x unordered (type="standard") or ordered (type="ordered")?
If "standard", then a lasso penalty is applied to v, to enforce sparsity. If "ordered"
(generally used for CGH data), then a fused lasso penalty is applied, to enforce
both sparsity and smoothness.
typez Are the columns of z unordered (type="standard") or ordered (type="ordered")?
If "standard", then a lasso penalty is applied to v, to enforce sparsity. If "ordered"
(generally used for CGH data), then a fused lasso penalty is applied, to enforce
both sparsity and smoothness.
penaltyxs The set of x penalties to be considered. If typex="standard", then the L1 bound
on u is penaltyxs*sqrt(ncol(x)). If "ordered", then it’s the lambda for the fused
lasso penalty. The user can specify a single value or a vector of values. If
penaltyxs is a vector and penaltyzs is a vector, then the vectors must have the
same length. If NULL, then the software will automatically choose a single
lambda value if type is "ordered", or a grid of (L1 bounds)/sqrt(ncol(x)) if type
is "standard".
10 CCA.permute

penaltyzs The set of z penalties to be considered. If typez="standard", then the L1 bound

on v is penaltyzs*sqrt(ncol(z)). If "ordered", then it’s the lambda for the fused
lasso penalty. The user can specify a single value or a vector of values. If
penaltyzs is a vector and penaltyzs is a vector, then the vectors must have the
same length. If NULL, then the software will automatically choose a single
lambda value if type is "ordered", or a grid of (L1 bounds)/sqrt(ncol(z)) if type
is "standard".
niter How many iterations should be performed each time CCA is called? Default
is 3, since an approximate estimate of u and v is acceptable in this case, and
otherwise this function can be quite time-consuming.
v The first K columns of the v matrix of the SVD of X’Z. If NULL, then the SVD
of X’Z will be computed inside this function. However, if you plan to run this
function multiple times, then save a copy of this argument so that it does not
need to be re-computed (since that process can be time-consuming if X and Z
both have high dimension).
trace Print out progress?
nperms How many times should the data be permuted? Default is 25. A large value of
nperms is very important here, since the formula for computing the z-statistics
requires a standard deviation estimate for the correlations obtained via permuta-
tion, which will not be accurate if nperms is very small.
standardize Should the columns of X and Z be centered (to have mean zero) and scaled (to
have standard deviation 1)? Default is TRUE.
chromx Used only if typex="ordered"; a vector of length ncol(x) that allows you to spec-
ify which chromosome each CGH spot is on. If NULL, then it is assumed that
all CGH spots are on same chromosome.
chromz Used only if typex="ordered"; a vector of length ncol(z) that allows you to spec-
ify which chromosome each CGH spot is on. If NULL, then it is assumed that
all CGH spots are on same chromosome.
upos If TRUE, then require all elements of u to be positive in sign. Default is FALSE.
Can only be used if type is standard.
uneg If TRUE, then require all elements of u to be negative in sign. Default is FALSE.
Can only be used if type is standard.
vpos If TRUE, then require all elements of v to be positive in sign. Default is FALSE.
Can only be used if type is standard.
vneg If TRUE, then require all elements of v to be negative in sign. Default is FALSE.
Can only be used if type is standard.
outcome If you would like to incorporate a phenotype into CCA analysis - that is, you
wish to find features that are correlated across the two data sets and also corre-
lated with a phenotype - then use one of "survival", "multiclass", or "quantita-
tive" to indicate outcome type. Default is NULL.
y If outcome is not NULL, then this is a vector of phenotypes - one for each row
of x and z. If outcome is "survival" then these are survival times; must be non-
negative. If outcome is "multiclass" then these are class labels. Default NULL.
cens If outcome is "survival" then these are censoring statuses for each observation.
1 is complete, 0 is censored. Default NULL.
... not used.
CCA.permute 11

Details
Note that x and z must have same number of rows. This function performs just a one-dimensional
search in tuning parameter space, even if penaltyxs and penaltyzs both are vectors: the pairs (penal-
tyxs[1],penaltyzs[1]),(penaltyxs[2],penaltyzs[2]),.... are considered.

Value
zstat The vector of z-statistics, one per element of sumabss.
pvals The vector of p-values, one per element of sumabss.
bestpenaltyx The x penalty that resulted in the highest z-statistic.
bestpenaltyz The z penalty that resulted in the highest z-statistic.
cors The value of cor(Xu,Zv) obtained for each value of sumabss.
corperms The nperms values of cor(X*u*,Zv*) obtained for each value of sumabss, where
X* indicates the X matrix with permuted rows, and u* and v* are the output of
CCA using data (X*,Z).
ft.cors The result of applying Fisher transformation to cors.
ft.corperms The result of applying Fisher transformation to corperms.
nnonzerous Number of non-zero u’s resulting from applying CCA to data (X,Z) for each
value of sumabss.
nnonzerouv Number of non-zero v’s resulting from applying CCA to data (X,Z) for each
value of sumabss.
v.init The first factor of the v matrix of the SVD of x’z. This is saved in case this
function (or the CCA function) will be re-run later.

Author(s)
Daniela M. Witten, Robert Tibshirani

MultiCCA Perform sparse multiple canonical correlation analysis.

Description
Given matrices $X1,...,XK$, which represent K sets of features on the same set of samples, find
sparse $w1,...,wK$ such that $sum_(i<j) (wi’ Xi’ Xj wj)$ is large. If the columns of Xk are ordered
(and type="ordered") then wk will also be smooth. For $X1,...,XK$, the samples are on the rows
and the features are on the columns. $X1,...,XK$ must have same number of rows, but may (and
usually will) have different numbers of columns.

Usage
MultiCCA(xlist, penalty=NULL, ws=NULL,
niter=25, type="standard", ncomponents=1, trace=TRUE, standardize=TRUE)
## S3 method for class 'MultiCCA'
print(x,...)

Arguments
xlist A list of length K, where K is the number of data sets on which to perform
sparse multiple CCA. Data set k should be a matrix of dimension $n x p_k$
where $p_k$ is the number of features in data set k.
penalty The penalty terms to be used. Can be a single value (if the same penalty term is
to be applied to each data set) or a K-vector, indicating a different penalty term
for each data set. There are 2 possible interpretations for the penalty terms: If
type="standard" then this is an L1 bound on wk, and it must be between 1 and
$sqrt(p_k)$ ($p_k$ is the number of features in matrix Xk). If type="ordered"
then this is the parameter for the fused lasso penalty on wk.
type Are the columns of $x1,...,xK$ unordered (type="standard") or ordered (type="ordered")?
If "standard", then a lasso penalty is applied to v, to enforce sparsity. If "or-
dered" (generally used for CGH data), then a fused lasso penalty is applied, to
enforce both sparsity and smoothness. This argument can be a vector of length
K (if different data sets are of different types) or it can be a single value "or-
dered"/"standard" (if all data sets are of the same type).
ncomponents How many factors do you want? Default is 1.
niter How many iterations should be performed? Default is 25.
ws A list of length K. The kth element contains the first ncomponents columns of
the v matrix of the SVD of Xk. If NULL, then the SVD of $X1,...,XK$ will
be computed inside the MultiCCA function. However, if you plan to run this
function multiple times, then save a copy of this argument so that it does not
need to be re-computed.
trace Print out progress?
standardize Should the columns of $X1,...,XK$ be centered (to have mean zero) and scaled
(to have standard deviation 1)? Default is TRUE.
MultiCCA 13

x not used.
... not used.

Value
ws A list of length K, containg the sparse canonical variates found (element k is a
$p_k x ncomponents$ matrix).
ws.init A list of length K containing the initial values of ws used, by default these are
the v vector of the svd of matrix Xk.

Author(s)
Daniela M. Witten and Robert Tibshirani

See Also
MultiCCA.permute,CCA, CCA.permute

Examples
# Generate 3 data sets so that first 25 features are correlated across
# the data sets...
u <- matrix(rnorm(50),ncol=1)
v1 <- matrix(c(rep(.5,25),rep(0,75)),ncol=1)
v2 <- matrix(c(rep(1,25),rep(0,25)),ncol=1)
v3 <- matrix(c(rep(.5,25),rep(0,175)),ncol=1)

x1 <- u%%t(v1) + matrix(rnorm(50100),ncol=100)

x2 <- u%*%t(v2) + matrix(rnorm(50*50),ncol=50)
x3 <- u%*%t(v3) + matrix(rnorm(50*200),ncol=200)

xlist <- list(x1, x2, x3)

# Run MultiCCA.permute w/o specifying values of tuning parameters to

# try.
# The function will choose the lambda for the ordered data set.
# Then permutations will be used to select optimal sum(abs(w)) for
# standard data sets.
# We assume that x1 is standard, x2 is ordered, x3 is standard:
perm.out <- MultiCCA.permute(xlist, type=c("standard", "ordered",
"standard"))
print(perm.out)
plot(perm.out)
out <- MultiCCA(xlist, type=c("standard", "ordered", "standard"),
penalty=perm.out$bestpenalties, ncomponents=2, ws=perm.out$ws.init)
14 MultiCCA.permute

print(out)
# Or if you want to specify tuning parameters by hand:
# this time, assume all data sets are standard:
perm.out <- MultiCCA.permute(xlist, type="standard",
penalties=cbind(c(1.1,1.1,1.1),c(2,3,4),c(5,7,10)), ws=perm.out$ws.init)
print(perm.out)
plot(perm.out)

# Making use of the fact that the features are ordered:

out <- MultiCCA(xlist, type="ordered", penalty=.6)
par(mfrow=c(3,1))
PlotCGH(out$ws[[1]], chrom=rep(1,ncol(x1)))
PlotCGH(out$ws[[2]], chrom=rep(2,ncol(x2)))
PlotCGH(out$ws[[3]], chrom=rep(3,ncol(x3)))

MultiCCA.permute Select tuning parameters for sparse multiple canonical correlation

analysis using the penalized matrix decomposition.

Description

This function can be used to automatically select tuning parameters for sparse multiple CCA. This
is the analog of sparse CCA, when >2 data sets are available. Each data set may have features
of type="standard" or type="ordered" (e.g. CGH data). Assume that there are K data sets, called
$X1,...,XK$.
The tuning parameters are selected using a permutation scheme. For each candidate tuning parame-
ter value, the following is performed: (1) Repeat the following n times, for n large: (a) The samples
in $(X1,...,XK)$ are randomly permuted to obtain data sets $(X1*,...,XK*)$. (b) Sparse multiple
CCA is run on the permuted data sets $(X1*,...,XK*)$ to get canonical variates $(w1*,...,wK*)$.
(c) Record $t* = sum_(i<j) Cor(Xi* wi*, Xj* wj*)$. (2) Sparse CCA is run on the original data
$(X1,...,XK)$ to obtain canonical variates $(w1,...,wK)$. (3) Record $t = sum_(i<j) Cor(Xi wi,
Xj wj)$. (4) The resulting p-value is given by $mean(t* > t)$; that is, the fraction of permuted
totals that exceed the total on the real data. Then, choose the tuning parameter value that gives the
smallest value in Step 4.
This function only selets tuning parameters for the FIRST sparse multiple CCA factors.

Usage

MultiCCA.permute(xlist, penalties, ws=NULL,

type="standard", nperms=10, niter=3, trace=TRUE, standardize=TRUE)
## S3 method for class 'MultiCCA.permute'
print(x,...)
## S3 method for class 'MultiCCA.permute'
plot(x,...)
MultiCCA.permute 15

Arguments
xlist A list of length K, where K is the number of data sets on which to perform
sparse multiple CCA. Data set k should be a matrix of dimension $n x p_k$
where $p_k$ is the number of features in data set k.
penalties The penalty terms to be considered in the cross-validation. If the same penalty
term is desired for each data set, then this should be a vector of length equal
to the number of penalty terms to be considered. If different penalty terms are
desired for each data set, then this should be a matrix with rows equal to the
number of data sets, and columns equal to the number of penalty terms to be
considered. For a given data set Xk, if type is "standard" then the penalty term
should be a number between 1 and $sqrt(p_k)$ (the number of features in data
set k); it is a L1 bound on wk. If type is "ordered", on the other hand, the penalty
term is of the form lambda in the fused lasso penalty. Therefore, the interpreta-
tion of the argument depends on whether type is "ordered" or "standard" for this
data set.
type A K-vector containing elements "standard" or "ordered" - or a single value. If a
single value, then it is assumed that all elements are the same (either "standard"
or "ordered"). If columns of v are ordered (e.g. CGH spots ordered along the
chromosome) then "ordered", otherwise use "standard". "standard" will result
in a lasso ($L_1$) penalty on v, which will result in smoothness. "ordered" will
result in a fused lasso penalty on v, yielding both sparsity and smoothness.
niter How many iterations should be performed each time CCA is called? Default
is 3, since an approximate estimate of u and v is acceptable in this case, and
otherwise this function can be quite time-consuming.
ws A list of length K; the kth element contanis the first ncomponents columns of
the v matrix of the SVD of Xk. If NULL, then the SVD of Xk will be computed
inside this function. However, if you plan to run this function multiple times,
then save a copy of this argument so that it does not need to be re-computed.
trace Print out progress?
nperms How many times should the data be permuted? Default is 25. A large value of
nperms is very important here, since the formula for computing the z-statistics
requires a standard deviation estimate for the correlations obtained via permuta-
tion, which will not be accurate if nperms is very small.
standardize Should the columns of X and Z be centered (to have mean zero) and scaled (to
have standard deviation 1)? Default is TRUE.
x not used.
... not used.

Details
Note that $x1,...,xK$ must have same number of rows. This function performs just a one-dimensional
search in tuning parameter space.

Value
zstat The vector of z-statistics, one per element of penalties.
16 PlotCGH

pvals The vector of p-values, one per element of penalties.

bestpenalties The best set of penalties (the one with the highest zstat).
cors The value of $sum_(j<k) cor(Xk wk, Xj wj)$ obtained for each value of penal-
ties.
corperms The nperms values of $sum_(j<k) cor(Xk* wk*, Xj* wj*)$ obtained for each
value of penalties, where Xk* indicates the Xk matrix with permuted rows, and
wk* is the canonical variate corresponding to the permuted data.
ws.init Initial values used for ws in sparse multiple CCA algorithm.

Author(s)
Daniela M. Witten, Robert Tibshirani

PlotCGH Plot CGH data

Description
Given a vector of gains/losses at CGH spots, this makes a plot of gain/loss on each chromosome.

Usage
PlotCGH(array,chrom=NULL,nuc=NULL,main="",scaleEachChrom=TRUE)

Arguments
array A vector containing the chromosomal location of each CGH spot.
chrom A numeric vector of the same length as "array"; its values should indicate the
chromosome that each CGH spot is on (for instance, for human genomic data,
values of chrom should range from 1 to 24). If NULL, then it is assumed that
all elements of ’array’ are on the same chromosome.
PlotCGH 17

nuc A numeric vector of same length as "array", indicating the nucleotide position
of each CGH spot. If NULL, then the function assumes that each CGH spot
corresponds to a consecutive position. E.g. if there are 200 CGH spots on
chromosome 1, then they are located at positions 1,2,...,199,200.
main Give your plot a title.
scaleEachChrom Default is TRUE. This means that each chromosomes CGH spots are divided by
1.1 times the max of the CGH spots on that chromosome. This way, the CGH
spots on each chromosome of the plot are as big as possible (i.e. easy to see).
If FALSE, then all of the CGH spots are divided by 1.1 times the max of ALL
the CGH spots. This means that on some chromosomes CGH spots might be
hard to see, but has the advantage that now relative magnitudes of CGH spots
on different chromosomes can be seen from figure.

Details

This function makes a plot of regions of genomic gain/loss.

Author(s)

Daniela M. Witten (adapted from Pei Wang and Rob Tibshirani’s cghFLasso package)

References

Witten DM, Tibshirani R and T Hastie (2008) A penalized matrix decomposition with applica-
tions to sparse principal components and canonical correlation analysis. Submitted. <http://www-
stat.stanford.edu/~dwitten>

PMD, PMD.cv, CCA, CCA.permute

Examples
# Use breast data
data(breastdata)
attach(breastdata)

# dna contains CGH data and chrom contains chromosome of each CGH spot;
# nuc contains position of each CGH spot.
dna <- t(dna)
PlotCGH(dna[1,],chrom=chrom,nuc=nuc,main="Sample 1: All Chromosomes")
PlotCGH(dna[1,chrom==1], chrom=chrom[chrom==1], nuc=nuc[chrom==1],
main= "Sample 1: Chrom 1")
PlotCGH(dna[1,chrom<=3], chrom=chrom[chrom<=3], nuc=nuc[chrom<=3],
main="Sample 1: Chroms 1, 2, and 3")
detach(breastdata)
18 PMD

PMD Get a penalized matrix decomposition for a data matrix.

Description

Performs a penalized matrix decomposition for a data matrix. Finds factors u and v that summarize
the data matrix well. u and v will both be sparse, and v can optionally also be smooth.

Usage

PMD(x, type=c("standard", "ordered"), sumabs=.4, sumabsu=5,

sumabsv=NULL, lambda=NULL, niter=20, K=1, v=NULL, trace=TRUE,
center=TRUE, chrom=NULL, rnames=NULL, cnames=NULL, upos=FALSE,
uneg=FALSE, vpos=FALSE, vneg=FALSE)

Arguments

x Data matrix of dimension $n x p$, which can contain NA for missing values.
type "standard" or "ordered": Do we want v to simply be sparse, or should it also be
smooth? If the columns of x are ordered (e.g. CGH spots along a chromosome)
then choose "ordered". Default is "standard". If "standard", then the PMD func-
tion will make use of sumabs OR sumabsu&sumabsv. If "ordered", then the
function will make use of sumabsu and lambda.
sumabs Used only if type is "standard". A measure of sparsity for u and v vectors, be-
tween 0 and 1. When sumabs is specified, and sumabsu and sumabsv are NULL,
then sumabsu is set to $sqrt(n)*sumabs$ and sumabsv is set to $sqrt(p)*sumabs$.
If sumabs is specified, then sumabsu and sumabsv should be NULL. Or if sum-
absu and sumabsv are specified, then sumabs should be NULL.
sumabsu Used for types "ordered" AND "standard". How sparse do you want u to be?
This is the sum of absolute values of elements of u. It must be between 1 and
the square root of the number of rows in data matrix. The smaller it is, the
sparser u will be.
sumabsv Used only if type is "standard". How sparse do you want v to be? This is the
sum of absolute values of elements of v. It must be between 1 and square root
of number of columns of data. The smaller it is, the sparser v will be.
lambda Used only if type is "ordered". This is the tuning parameter for the fused lasso
penalty on v, which takes the form $lambda ||v||_1 + lambda |v_j - v_(j-1)|$.
$lambda$ must be non-negative. If NULL, then it is chosen adaptively from the
data.
niter How many iterations should be performed. It is best to run at least 20 of so.
Default is 20.
K The number of factors in the PMD to be returned; default is 1.
PMD 19

v The first right singular vector(s) of the data. (If missing data is present, then the
missing values are imputed before the singular vectors are calculated.) v is used
as the initial value for the iterative PMD algorithm. If x is large, then this step
can be time-consuming; therefore, if PMD is to be run multiple times, then v
should be computed once and saved.
trace Print out progress as iterations are performed? Default is TRUE.
center Subtract out mean of x? Default is TRUE.
chrom If type is "ordered", then this gives the option to specify that some columns of
x (corresponding to CGH spots) are on different chromosomes. Then v will
be sparse, and smooth *within* each chromosome but not *between* chromo-
somes. Length of chrom should equal number of columns of x, and each entry in
chrom should be a number corresponding to which chromosome the CGH spot
is on.
rnames An optional vector containing a name for each row of x.
cnames An optional vector containing a name for each column of x.
upos Constrain the elements of u to be positive? TRUE or FALSE.
uneg Constrain the elements of u to be negative? TRUE or FALSE.
vpos Constrain the elements of v to be positive? TRUE or FALSE. Cannot be used if
type is "ordered".
vneg Constrain the elements of v to be negative? TRUE or FALSE. Cannot be used if
type is "ordered."

Details
The criterion for the PMD is as follows: we seek vectors $u$ and $v$ such that $u’Xv$ is large,
subject to $||u||_2=1, ||v||_2=1$ and additional penalties on $u$ and $v$. These additional penalties
are as follows: If type is "standard", then lasso ($L_1$) penalties (promoting sparsity) are placed on
u and v. If type is "ordered", then lasso penalty is placed on u and a fused lasso penalty (promoting
sparsity and smoothness) is placed on v.
If type is "standard", then arguments sumabs OR sumabsu&sumabsv are used. If type is "ordered",
then sumabsu AND lambda are used. Sumabsu is the bound of absolute value of elements of u.
Sumabsv is bound of absolute value of elements of v. If sumabs is given, then sumabsu is set to
sqrt(nrow(x))*sumabs and sumabsv is set to sqrt(ncol(x))*sumabs. $lambda$ is the parameter for
the fused lasso penalty on v when type is "ordered": $lambda(||v||_1 + sum_j |v_j - v_(j-1))$.

Value
u u is output. If you asked for multiple factors then each column of u is a factor. u
has dimension nxK if you asked for K factors.
v v is output. If you asked for multiple factors then each column of v is a factor. v
has dimension pxK if you asked for K factors.
d d is output. Computationally, $d=u’Xv$ where $u$ and $v$ are the sparse fac-
tors output by the PMD function and $X$ is the data matrix input to the PMD
function. When K=1, the residuals of the rank-1 PMD are given by $X - duv’$.
20 PMD

v.init The first right singular vector(s) of the data; these are returned to save on com-
putation time if PMD will be run again.
meanx Mean of x that was subtracted out before PMD was performed.

Author(s)
Daniela M. Witten and Robert Tibshirani

# Try PMD with L1 penalty on rows and columns: type="standard"

# A simple simulated example
set.seed(1)
# Our data is a rank-one matrix, plus noise. The underlying components
# contain 50 and 75 non-zero elements, respectively.
u <- matrix(c(rnorm(50), rep(0,150)),
ncol=1)
v <- matrix(c(rnorm(75),rep(0,225)), ncol=1)
x <- u%*%t(v)+
matrix(rnorm(200*300),ncol=300)
# We can use cross-validation to try to find optimal value of sumabs
cv.out <- PMD.cv(x, type="standard", sumabss=seq(0.1, 0.6, len=20))
print(cv.out)
plot(cv.out)
# The optimal value of sumabs is 0.4157, but we can get within one
# standard error of that CV error using sumabs=0.337, which corresponds to
# an average of 45.8 and 71.8 non-zero elements in each component - pretty
# close to the true model.
# We can fit the model corresponding to the lowest cross-validation error:
out <- PMD(x, type="standard", sumabs=cv.out$bestsumabs, K=1, v=cv.out$v.init)
print(out)
par(mfrow=c(2,2))
par(mar=c(2,2,2,2))
plot(out$u[,1], main="Est. u")
plot(out$v[,1], main="Est. v")
plot(u, main="True u")
plot(v, main="True v")
# And if we want to control sumabsu and sumabsv separately, we can do
# that too. Let's get 2 components while we're at it:
out2 <- PMD(x, type="standard", K=2, sumabsu=6, sumabsv=8, v=out$v.init,
cnames=paste("v", sep=" ", 1:ncol(x)), rnames=paste("u", sep=" ", 1:nrow(x)))
PMD.cv 21

print(out2)

# Now check out PMD with L1 penalty on rows and fused lasso penalty on
# columns: type="ordered". We'll use the Chin et al (2006) Cancer Cell
# data set; try "?breastdata" for more info.
data(breastdata)
attach(breastdata)
# dna contains CGH data and chrom contains chromosome of each CGH spot;
# nuc contains position of each CGH spot.
dna <- t(dna) # Need samples on rows and CGH spots on columns
# First, look for shared regions of gain/loss on chromosome 1.
# Use cross-validation to choose tuning parameter value
par(mar=c(2,2,2,2))
cv.out <- PMD.cv(dna[,chrom==1],type="ordered",chrom=chrom[chrom==1],
nuc=nuc[chrom==1],
sumabsus=seq(1, sqrt(nrow(dna)), len=15))
print(cv.out)
plot(cv.out)
out <- PMD(dna[,chrom==1],type="ordered",
sumabsu=cv.out$bestsumabsu,chrom=chrom[chrom==1],K=1,v=cv.out$v.init,
cnames=paste("Pos",sep="",
nuc[chrom==1]), rnames=paste("Sample", sep=" ", 1:nrow(dna)))
print(out, verbose=TRUE)
# Which samples actually have that region of gain/loss?
par(mfrow=c(3,1))
par(mar=c(2,2,2,2))
PlotCGH(dna[which.min(out$u[,1]),chrom==1],chrom=chrom[chrom==1],
main=paste(paste(paste("Sample ", sep="", which.min(out$u[,1])),
sep="; u=", round(min(out$u[,1]),3))),nuc=nuc[chrom==1])
PlotCGH(dna[88,chrom==1], chrom=chrom[chrom==1],
main=paste("Sample 88; u=", sep="", round(out$u[88,1],3)),
nuc=nuc[chrom==1])
PlotCGH(out$v[,1],chrom=chrom[chrom==1], main="V",nuc=nuc[chrom==1])

detach(breastdata)

PMD.cv Do tuning parameter selection for PMD via cross-validation

Description

Performs cross-validation to select tuning parameters for rank-1 PMD, the penalized matrix decom-
position for a data matrix.
22 PMD.cv

Usage
PMD.cv(x, type=c("standard", "ordered"), sumabss=seq(0.1,0.7,len=10),
sumabsus=NULL, lambda=NULL, nfolds=5, niter=5, v=NULL, chrom=NULL, nuc=NULL,
trace=TRUE, center=TRUE, upos=FALSE, uneg=FALSE, vpos=FALSE, vneg=FALSE)

Arguments
x Data matrix of dimension $n x p$, which can contain NA for missing values.
type "standard" or "ordered": Do we want v to simply be sparse, or should it also be
smooth? If the columns of x are ordered (e.g. CGH spots along a chromosome)
then choose "ordered". Default is "standard". If "standard", then the PMD func-
tion will make use of sumabs OR sumabsu&sumabsv. If "ordered", then the
function will make use of sumabsu and lambda.
sumabss Used only if type is "standard". A vector of sumabs values to be used. Sumabs
is a measure of sparsity for u and v vectors, between 0 and 1. When sumabss
is specified, and sumabsus and sumabsvs are NULL, then sumabsus is set to
$sqrt(n)*sumabss$ and sumabsvs is set at $sqrt(p)*sumabss$. If sumabss is
specified, then sumabsus and sumabsvs should be NULL. Or if sumabsus and
sumabsvs are specified, then sumabss should be NULL.
sumabsus Used only for type "ordered". A vector of sumabsu values to be used. Sumabsu
measures sparseness of u - it is the sum of absolute values of elements of u.
Must be between 1 and sqrt(n).
lambda Used only if type is "ordered". This is the tuning parameter for the fused lasso
penalty on v, which takes the form $lambda ||v||_1 + lambda |v_j - v_(j-1)|$.
$lambda$ must be non-negative. If NULL, then it is chosen adaptively from the
data.
nfolds How many cross-validation folds should be performed? Default is 5.
niter How many iterations should be performed. For speed, only 5 are performed by
default.
v The first right singular vector(s) of the data. (If missing data is present, then the
missing values are imputed before the singular vectors are calculated.) v is used
as the initial value for the iterative PMD algorithm. If x is large, then this step
can be time-consuming; therefore, if PMD is to be run multiple times, then v
should be computed once and saved.
chrom If type is "ordered", then this gives the option to specify that some columns of
x (corresponding to CGH spots) are on different chromosomes. Then v will
be sparse, and smooth *within* each chromosome but not *between* chromo-
somes. Length of chrom should equal number of columns of x, and each entry in
chrom should be a number corresponding to which chromosome the CGH spot
is on.
nuc If type is "ordered", can specify the nucleotide position of each CGH spot (col-
umn of x), to be used in plotting. If NULL, then it is assumed that CGH spots
are equally spaced.
trace Print out progress as iterations are performed? Default is TRUE.
center Subtract out mean of x? Default is TRUE
PMD.cv 23

upos Constrain the elements of u to be positive? TRUE or FALSE.

uneg Constrain the elements of u to be negative? TRUE or FALSE.
vpos Constrain the elements of v to be positive? TRUE or FALSE. Cannot be used if
type is "ordered".
vneg Constrain the elements of v to be negative? TRUE or FALSE. Cannot be used if
type is "ordered."

Details
If type is "standard", then lasso ($L_1$) penalties (promoting sparsity) are placed on u and v. If
type is "ordered", then lasso penalty is placed on u and a fused lasso penalty (promoting sparsity
and smoothness) is placed on v.
Cross-validation of the rank-1 PMD is performed over sumabss (if type is "standard") or over sum-
absus (if type is "ordered"). If type is "ordered", then lambda is chosen from the data without
cross-validation.
The cross-validation works as follows: Some percent of the elements of $x$ is removed at random
from the data matrix. The PMD is performed for a range of tuning parameter values on this partially-
missing data matrix; then, missing values are imputed using the decomposition obtained. The value
of the tuning parameter that results in the lowest sum of squared errors of the missing values if
"best".
To do cross-validation on the rank-2 PMD, first the rank-1 PMD should be computed, and then this
function should be performed on the residuals, given by $x-udv’$.

Value
cv Average sum of squared errors obtained over cross-validation folds.
cv.error Standard error of average sum of squared errors obtained over cross-validation
folds.
bestsumabs If type="standard", then value of sumabss resulting in smallest CV error is re-
turned.
bestsumabsu If type="ordered", then value of sumabsus resulting in smallest CV error is re-
turned.
v.init The first right singular vector(s) of the data; these are returned to save on com-
putation time if PMD will be run again.

Author(s)
Daniela M. Witten and Robert Tibshirani

# See examples in PMD help file

SPC Perform sparse principal component analysis

Description
Performs sparse principal components analysis by applying PMD to a data matrix with lasso ($L_1$)
penalty on the columns and no penalty on the rows.

Usage
SPC(x, sumabsv=4, niter=20, K=1, orth=FALSE, trace=TRUE, v=NULL,
center=TRUE, cnames=NULL, vpos=FALSE, vneg=FALSE, compute.pve=TRUE)
## S3 method for class 'SPC'
print(x,verbose=FALSE,...)

Arguments
x Data matrix of dimension $n x p$, which can contain NA for missing values.
We are interested in finding sparse principal components of dimension $p$.
sumabsv How sparse do you want v to be? This is the sum of absolute values of elements
of v. It must be between 1 and square root of number of columns of data. The
smaller it is, the sparser v will be.
niter How many iterations should be performed. It is best to run at least 20 of so.
Default is 20.
K The number of factors in the PMD to be returned; default is 1.
v The first right singular vector(s) of the data. (If missing data is present, then the
missing values are imputed before the singular vectors are calculated.) v is used
as the initial value for the iterative PMD($L_1$, $L_1$) algorithm. If x is large,
then this step can be time-consuming; therefore, if PMD is to be run multiple
times, then v should be computed once and saved.
trace Print out progress as iterations are performed? Default is TRUE.
orth If TRUE, then use method of Section 3.2 of Witten, Tibshirani and Hastie (2008)
to obtain multiple sparse principal components. Default is FALSE.
center Subtract out mean of x? Default is TRUE
cnames An optional vector containing a name for each column.
SPC 25

vpos Constrain the elements of v to be positive? TRUE or FALSE.

vneg Constrain the elements of v to be negative? TRUE or FALSE.
compute.pve Compute percent variance explained? Default TRUE. If not needed, then choose
FALSE to save time.
... not used.
verbose not used.

Details
PMD(x,sumabsu=sqrt(nrow(x)), sumabsv=3, K=1) and SPC(x,sumabsv=3, K=1) give the same re-
sult, since the SPC method is simply PMD with an L1 penalty on the columns and no penalty on
the rows.
In Witten, Tibshirani, and Hastie (2008), two methods are presented for obtaining multiple factors
for SPC. The methods are as follows:
(1) If one has already obtained factors $k-1$ factors then oen can compute residuals by subtracting
out these factors. Then $u_k$ and $v_k$ can be obtained by applying the SPC/PMD algorithm to
the residuals.
(2) One can require that $u_k$ be orthogonal to $u_i$’s with $i<k$; the method is slightly more
complicated, and is explained in WT&H(2008).
Method 1 is performed by running SPC with option orth=FALSE (the default) and Method 2 is
performed using option orth=TRUE. Note that Methods 1 and 2 always give identical results for the
first component, and often given quite similar results for later components.

Value
u u is output. If you asked for multiple factors then each column of u is a factor. u
has dimension nxK if you asked for K factors.
v v is output. These are the sparse principal components. If you asked for multiple
factors then each column of v is a factor. v has dimension pxK if you asked for
K factors.
d d is output; it is the diagonal of the matrix $D$ in the penalized matrix decom-
position. In the case of the rank-1 decomposition, it is given in the formulation
$||X-duv’||_F^2$ subject to $||u||_1 <= sumabsu$, $||v||_1 <= sumabsv$. Com-
putationally, $d=u’Xv$ where $u$ and $v$ are the sparse factors output by the
PMD function and $X$ is the data matrix input to the PMD function.
prop.var.explained
A vector containing the proportion of variance explained by the first 1, 2, ..., K
sparse principal components obtaineds. Formula for proportion of variance ex-
plained is on page 20 of Shen & Huang (2008), Journal of Multivariate Analysis
99: 1015-1034.
v.init The first right singular vector(s) of the data; these are returned to save on com-
putation time if PMD will be run again.
meanx Mean of x that was subtracted out before SPC was performed.
26 SPC

Author(s)
Daniela M. Witten and Robert Tibshirani

# A simple simulated example

set.seed(1)
u <- matrix(c(rnorm(50), rep(0,150)),ncol=1)
v <- matrix(c(rnorm(75),rep(0,225)), ncol=1)
x <- u%*%t(v)+matrix(rnorm(200*300),ncol=300)
# Perform Sparse PCA - that is, decompose a matrix w/o penalty on rows
# and w/ L1 penalty on columns
# First, we perform sparse PCA and get 4 components, but we do not
# require subsequent components to be orthogonal to previous components
out <- SPC(x,sumabsv=3, K=4)
print(out,verbose=TRUE)
# We could have selected sumabsv by cross-validation, using function SPC.cv
# Now, we do sparse PCA using method in Section 3.2 of WT&H(2008) for getting
# multiple components - that is, we require components to be orthogonal
out.orth <- SPC(x,sumabsv=3, K=4, orth=TRUE)
print(out.orth,verbose=TRUE)
par(mfrow=c(1,1))
plot(out$u[,1], out.orth$u[,1], xlab="", ylab="")
# Note that the first components w/ and w/o orth option are identical,
# since the orth option only affects the way that subsequent components
# are found
print(round(t(out$u)%*%out$u,4)) # not orthogonal
print(round(t(out.orth$u)%*%out.orth$u,4)) # orthogonal

# Use SPC.cv to choose tuning parameters:

cv.out <- SPC.cv(x)
print(cv.out)
plot(cv.out)
out <- SPC(x, sumabsv=cv.out$bestsumabsv)
print(out)
# or we could do
out <- SPC(x, sumabsv=cv.out$bestsumabsv1se)
print(out)
SPC.cv 27

SPC.cv Perform cross-validation on sparse principal component analysis

Description

Selects tuning parameter for the sparse principal component analysis method of Witten, Tibshirani,
and Hastie (2008), which involves applying PMD to a data matrix with lasso ($L_1$) penalty on
the columns and no penalty on the rows. The tuning parameter controls the sum of absolute values
- or $L_1$ norm - of the elements of the sparse principal component.

Usage

SPC.cv(x, sumabsvs=seq(1.2, 5,len=10), nfolds=5, niter=5, v=NULL,

trace=TRUE, orth=FALSE, center=TRUE, vpos=FALSE, vneg=FALSE)
## S3 method for class 'SPC.cv'
plot(x,...)
## S3 method for class 'SPC.cv'
print(x,...)

Arguments

x Data matrix of dimension $n x p$, which can contain NA for missing values.
We are interested in finding sparse principal components of dimension $p$.
sumabsvs Range of sumabsv values to be considered in cross-validation. Sumabsv is the
sum of absolute values of elements of v. It must be between 1 and square root
of number of columns of data. The smaller it is, the sparser v will be.
nfolds Number of cross-validation folds performed.
niter How many iterations should be performed. By default, perform only 5 for speed
reasons.
v The first right singular vector(s) of the data. (If missing data is present, then the
missing values are imputed before the singular vectors are calculated.) v is used
as the initial value for the iterative PMD($L_1$, $L_1$) algorithm. If x is large,
then this step can be time-consuming; therefore, if PMD is to be run multiple
times, then v should be computed once and saved.
trace Print out progress as iterations are performed? Default is TRUE.
orth If TRUE, then use method of Section 3.2 of Witten, Tibshirani and Hastie (2008)
to obtain multiple sparse principal components. Default is FALSE.
center Subtract out mean of x? Default is TRUE
vpos Constrain elements of v to be positive? Default is FALSE.
vneg Constrain elements of v to be negative? Default is FALSE.
... not used.
28 SPC.cv

Details
This method only performs cross-validation for the first sparse principal component. It does so
by performing the following steps nfolds times: (1) replace a fraction of the data with missing
values, (2) perform SPC on this new data matrix using a range of tuning parameter values, each
time getting a rank-1 approximationg $udv’$ where $v$ is sparse, (3) measure the mean squared
error of the rank-1 estimate of the missing values created in step 1.
Then, the selected tuning parameter value is that which resulted in the lowest average mean squared
error in step 3.
In order to perform cross-validation for the second sparse principal component, apply this function
to $X-udv’$ where $udv’$ are the output of running SPC on the raw data $X$.

Value
cv Average sum of squared errors that results for each tuning parameter value.
cv.error Standard error of the average sum of squared error that results for each tuning
parameter value.
bestsumabsv Value of sumabsv that resulted in lowest CV error.
nonzerovs Average number of non-zero elements of v for each candidate value of sumab-
svs.
v.init Initial value of v that was passed in. Or, if that was NULL, then first right
singular vector of X.
bestsumabsv1se The smallest value of sumabsv that is within 1 standard error of smallest CV
error.

Author(s)
Daniela M. Witten and Robert Tibshirani

# First, we perform sparse PCA and get 4 components, but we do not

# require subsequent components to be orthogonal to previous components
cv.out <- SPC.cv(x, sumabsvs=seq(1.2, sqrt(ncol(x)), len=6))
print(cv.out)
plot(cv.out)
out <- SPC(x,sumabsv=cv.out$bestsumabs, K=4) # could use
# cv.out$bestsumabvsv1se instead
print(out,verbose=TRUE)
# Now, we do sparse PCA using method in Section 3.2 of WT&H(2008) for getting
# multiple components - that is, we require components to be orthogonal
cv.out <- SPC.cv(x, sumabsvs=seq(1.2, sqrt(ncol(x)), len=6), orth=TRUE)
print(cv.out)
plot(cv.out)
out.orth <- SPC(x,sumabsv=cv.out$bestsumabsv, K=4, orth=TRUE)
print(out.orth,verbose=TRUE)
par(mfrow=c(1,1))
plot(out$u[,1], out.orth$u[,1], xlab="", ylab="")
Index

∗Topic datasets
breastdata, 3
∗Topic package
PMA-package, 2

breastdata, 3

CCA, 4, 11, 13, 16, 17

CCA.permute, 7, 8, 13, 16, 17

MultiCCA, 12, 16
MultiCCA.permute, 13, 14

plot.CCA.permute (CCA.permute), 8
plot.MultiCCA.permute
(MultiCCA.permute), 14
plot.SPC.cv (SPC.cv), 27
PlotCGH, 16
PMA (PMA-package), 2
PMA-package, 2
PMD, 7, 11, 17, 18, 24, 26, 28
PMD.cv, 17, 20, 21, 26, 28
print.CCA (CCA), 4
print.CCA.permute (CCA.permute), 8
print.MultiCCA (MultiCCA), 12
print.MultiCCA.permute
(MultiCCA.permute), 14
print.SPC (SPC), 24
print.SPC.cv (SPC.cv), 27

SPC, 20, 24, 24, 28

SPC.cv, 26, 27

BT3041 Topic9
No ratings yet
BT3041 Topic9
25 pages
What Is Principal Component Analysis?: Primer
No ratings yet
What Is Principal Component Analysis?: Primer
2 pages
PCA Primer
No ratings yet
PCA Primer
2 pages
DNA Copy-Number Analysis in R
No ratings yet
DNA Copy-Number Analysis in R
18 pages
IMAMultivariate 1
No ratings yet
IMAMultivariate 1
90 pages
Da Lab File 2
No ratings yet
Da Lab File 2
13 pages
R Misc Tools for Data Analysis
No ratings yet
R Misc Tools for Data Analysis
19 pages
NCL Functions and Procedures Reference Cards
No ratings yet
NCL Functions and Procedures Reference Cards
13 pages
Estrogen's Impact in 2x2 Design Study
No ratings yet
Estrogen's Impact in 2x2 Design Study
15 pages
Breast Cancer Gene Analysis with R
No ratings yet
Breast Cancer Gene Analysis with R
25 pages
Chi-Square and PCA Analysis Guide
100% (2)
Chi-Square and PCA Analysis Guide
97 pages
Sparse Principal Component Analysis Explained
No ratings yet
Sparse Principal Component Analysis Explained
23 pages
PCA and K-Means for Feature Reduction
No ratings yet
PCA and K-Means for Feature Reduction
56 pages
Jin-Xing Liu - 2013 - Pmid23815087
No ratings yet
Jin-Xing Liu - 2013 - Pmid23815087
10 pages
Affymetrix Microarray Data Analysis Guide
No ratings yet
Affymetrix Microarray Data Analysis Guide
16 pages
MSstats: Protein Significance Analysis
No ratings yet
MSstats: Protein Significance Analysis
59 pages
Da Thoery
No ratings yet
Da Thoery
24 pages
Sparse PCA Using Lasso and Elastic Net
No ratings yet
Sparse PCA Using Lasso and Elastic Net
30 pages
Fallsem2025-26 VL Bcse334l 00100 TH 2025-08-13 Pca
No ratings yet
Fallsem2025-26 VL Bcse334l 00100 TH 2025-08-13 Pca
80 pages
hst951 7
No ratings yet
hst951 7
32 pages
Microarray Analysis in Computational Genomics
No ratings yet
Microarray Analysis in Computational Genomics
13 pages
Chapter 04 Dimension Reduction (R)
No ratings yet
Chapter 04 Dimension Reduction (R)
27 pages
Saving R Environment to RData
No ratings yet
Saving R Environment to RData
60 pages
Document (26) - Copy 2
No ratings yet
Document (26) - Copy 2
17 pages
Protein Significance in Proteomics
No ratings yet
Protein Significance in Proteomics
47 pages
R Programming Basics and Data Analysis
No ratings yet
R Programming Basics and Data Analysis
117 pages
PCA Tutorial for CS898 Course
No ratings yet
PCA Tutorial for CS898 Course
12 pages
PCA for Image Processing Students
No ratings yet
PCA for Image Processing Students
26 pages
Large Covariance Matrices in Statistics
No ratings yet
Large Covariance Matrices in Statistics
76 pages
It22043 Unit 1 - Pca
No ratings yet
It22043 Unit 1 - Pca
47 pages
Correspondence Analysis in R
No ratings yet
Correspondence Analysis in R
13 pages
Sunil Test
No ratings yet
Sunil Test
15 pages
Visualizing High-Dimensional Data with PCA
No ratings yet
Visualizing High-Dimensional Data with PCA
14 pages
Huey Dissertation Final
No ratings yet
Huey Dissertation Final
257 pages
Genome Data Analysis Tools
No ratings yet
Genome Data Analysis Tools
11 pages
CMA Evolution Strategy Tutorial
No ratings yet
CMA Evolution Strategy Tutorial
34 pages
LIMMA: Gene Expression Analysis Tool
No ratings yet
LIMMA: Gene Expression Analysis Tool
168 pages
Multivariate Assignment 1:: Mean Infosys Tcs Wipro Raj - TV TV - 18 NDTV
No ratings yet
Multivariate Assignment 1:: Mean Infosys Tcs Wipro Raj - TV TV - 18 NDTV
9 pages
Data Science: Dimensionality Reduction
No ratings yet
Data Science: Dimensionality Reduction
24 pages
R Programming: Data Analysis Guide
No ratings yet
R Programming: Data Analysis Guide
60 pages
Introduction to R Programming
100% (8)
Introduction to R Programming
60 pages
R Reference Guide for Programmers
No ratings yet
R Reference Guide for Programmers
6 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
PCA Indexing in Agricultural Data Analysis
No ratings yet
PCA Indexing in Agricultural Data Analysis
6 pages
mixOmics: Multivariate Data Analysis
No ratings yet
mixOmics: Multivariate Data Analysis
100 pages
Introduction To R For Gene Expression Data Analysis
No ratings yet
Introduction To R For Gene Expression Data Analysis
11 pages
04 sparsePCA
No ratings yet
04 sparsePCA
22 pages
Microarray Analysis in Cancer Research
No ratings yet
Microarray Analysis in Cancer Research
56 pages
Essential R Commands for Data Analysis
No ratings yet
Essential R Commands for Data Analysis
17 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
34 pages
Introduction to conquestr Package
No ratings yet
Introduction to conquestr Package
7 pages
PBMC Guided Tutorial
No ratings yet
PBMC Guided Tutorial
27 pages
Ugc Textbooks Anthropology PDF
100% (1)
Ugc Textbooks Anthropology PDF
318 pages
Governance Challenges in Pakistan's Public Sector
No ratings yet
Governance Challenges in Pakistan's Public Sector
11 pages
Momin Vs Muslim Who We Are?
No ratings yet
Momin Vs Muslim Who We Are?
5 pages
Pass Papers
100% (1)
Pass Papers
33 pages
Pakistan's Path to Becoming an Asian Tiger
No ratings yet
Pakistan's Path to Becoming an Asian Tiger
31 pages
Introduction to Anthropological Inquiry
No ratings yet
Introduction to Anthropological Inquiry
5 pages
Advertisement No 9 2019
No ratings yet
Advertisement No 9 2019
1 page
Haviland95613 0495095613 02.01 Chapter01 PDF
No ratings yet
Haviland95613 0495095613 02.01 Chapter01 PDF
24 pages
Punjab Agriculture Job Openings 2019
No ratings yet
Punjab Agriculture Job Openings 2019
3 pages
Understanding War: Causes and Definitions
No ratings yet
Understanding War: Causes and Definitions
25 pages
Understanding the Nature of War
No ratings yet
Understanding the Nature of War
14 pages
Reforming Pakistan: A Vision for 2050
No ratings yet
Reforming Pakistan: A Vision for 2050
2 pages
WomenStudies 2011 12 PDF
No ratings yet
WomenStudies 2011 12 PDF
118 pages
Four Types of Writing Explained
No ratings yet
Four Types of Writing Explained
2 pages
Roll Number Slip: Govt. Shuhada - e - Aps Memorial (Boys) Model High School Model Town Block-B Lahore
No ratings yet
Roll Number Slip: Govt. Shuhada - e - Aps Memorial (Boys) Model High School Model Town Block-B Lahore
1 page
Request for Original Result Card
No ratings yet
Request for Original Result Card
2 pages
Psychological Disorders: Announcements
No ratings yet
Psychological Disorders: Announcements
14 pages
Error Detection and Correction Techniques
No ratings yet
Error Detection and Correction Techniques
40 pages
Data Communication & Networking
No ratings yet
Data Communication & Networking
31 pages
Digital-to-Analog Conversion Techniques
No ratings yet
Digital-to-Analog Conversion Techniques
23 pages
Research Design & Methodology Overview
No ratings yet
Research Design & Methodology Overview
29 pages
Flow and Error Control in Networking
No ratings yet
Flow and Error Control in Networking
14 pages
Algebra 2 1st Semester Final 50q
No ratings yet
Algebra 2 1st Semester Final 50q
9 pages
Understanding Calculus and Geometry Concepts
No ratings yet
Understanding Calculus and Geometry Concepts
8 pages
MMT 004
No ratings yet
MMT 004
5 pages
Serveradmin Lua
No ratings yet
Serveradmin Lua
27 pages
CRC Guide for Engineers
100% (3)
CRC Guide for Engineers
26 pages
BUSINESS MATH Lesson 1
86% (7)
BUSINESS MATH Lesson 1
32 pages
Numerical Methods Problem Set 1
No ratings yet
Numerical Methods Problem Set 1
2 pages
Solutions Exercises
100% (1)
Solutions Exercises
3 pages
Numerical Methods Worksheet
No ratings yet
Numerical Methods Worksheet
4 pages
Unit6 16-19
No ratings yet
Unit6 16-19
34 pages
(Ebook) Numerical Analysis 3rd Edition by Timothy Sauer Full
No ratings yet
(Ebook) Numerical Analysis 3rd Edition by Timothy Sauer Full
300 pages
Calculus (BMAT101L), Module 1 - Dr. T. Phaneendra
No ratings yet
Calculus (BMAT101L), Module 1 - Dr. T. Phaneendra
26 pages
CH 1 - Number System
No ratings yet
CH 1 - Number System
13 pages
Z-Transform Analysis and Solutions
No ratings yet
Z-Transform Analysis and Solutions
18 pages
Is -100 a Rational Number Explained
No ratings yet
Is -100 a Rational Number Explained
3 pages
Mathematics-I Course Overview and Objectives
No ratings yet
Mathematics-I Course Overview and Objectives
2 pages
Entrance Test Guidelines for Programs
No ratings yet
Entrance Test Guidelines for Programs
9 pages
One Option Correct Questions
No ratings yet
One Option Correct Questions
5 pages
Colourful Daily Fraction Practice Presentation - 20250805 - 113949 - 0000
No ratings yet
Colourful Daily Fraction Practice Presentation - 20250805 - 113949 - 0000
9 pages
MATH10079 Group Theory Exam Paper
100% (1)
MATH10079 Group Theory Exam Paper
3 pages
Diophantus' Algebraic Legacy
No ratings yet
Diophantus' Algebraic Legacy
3 pages
Vectors in 2-Space and 3-Space: Geometric Vectors Vectors Operations Dot Product Projections Cross Product
No ratings yet
Vectors in 2-Space and 3-Space: Geometric Vectors Vectors Operations Dot Product Projections Cross Product
22 pages
Mutete Booklet
No ratings yet
Mutete Booklet
194 pages
Definite Integration Problems and Formulas
No ratings yet
Definite Integration Problems and Formulas
8 pages
Grade XII Mathematics: Matrices Worksheet
No ratings yet
Grade XII Mathematics: Matrices Worksheet
2 pages
2-Year IPMAT Preparation Timetable For S
No ratings yet
2-Year IPMAT Preparation Timetable For S
6 pages
Numerical Solutions of 1D Advection Equation
No ratings yet
Numerical Solutions of 1D Advection Equation
2 pages
Spinors in Physics: Jean Hladik
No ratings yet
Spinors in Physics: Jean Hladik
6 pages
Worksheet 6.1: Laws of Rational Indices
No ratings yet
Worksheet 6.1: Laws of Rational Indices
10 pages
Advanced Calculus: Integration Techniques
No ratings yet
Advanced Calculus: Integration Techniques
139 pages

Package PMA': R Topics Documented

Uploaded by

Package PMA': R Topics Documented

Uploaded by

Package ‘PMA’

April 20, 2018

PMA-package Penalized Multivariate Analysis

typez Are the columns of z unordered (type="standard") or ordered (type="ordered")?

# Not run, to save time:

## Suppose we also have a quantitative outcome, y, and we want to find

## now, do CCA with type="ordered"

penaltyzs The set of z penalties to be considered. If typez="standard", then the L1 bound

MultiCCA Perform sparse multiple canonical correlation analysis.

x1 <- u%*%t(v1) + matrix(rnorm(50*100),ncol=100)

xlist <- list(x1, x2, x3)

# Run MultiCCA.permute w/o specifying values of tuning parameters to

# Making use of the fact that the features are ordered:

MultiCCA.permute Select tuning parameters for sparse multiple canonical correlation

MultiCCA.permute(xlist, penalties, ws=NULL,

pvals The vector of p-values, one per element of penalties.

PlotCGH Plot CGH data

This function makes a plot of regions of genomic gain/loss.

PMD, PMD.cv, CCA, CCA.permute

PMD Get a penalized matrix decomposition for a data matrix.

PMD(x, type=c("standard", "ordered"), sumabs=.4, sumabsu=5,

# Try PMD with L1 penalty on rows and columns: type="standard"

PMD.cv Do tuning parameter selection for PMD via cross-validation

upos Constrain the elements of u to be positive? TRUE or FALSE.

# See examples in PMD help file

SPC Perform sparse principal component analysis

vpos Constrain the elements of v to be positive? TRUE or FALSE.

# A simple simulated example

# Use SPC.cv to choose tuning parameters:

SPC.cv Perform cross-validation on sparse principal component analysis

SPC.cv(x, sumabsvs=seq(1.2, 5,len=10), nfolds=5, niter=5, v=NULL,

# First, we perform sparse PCA and get 4 components, but we do not

CCA, 4, 11, 13, 16, 17

SPC, 20, 24, 24, 28

You might also like

x1 <- u%%t(v1) + matrix(rnorm(50100),ncol=100)