What is the fused lasso?

The fused lasso, introduced in Tibshirani et al. (2005) (Reference 1), is an extension of the lasso. While the lasso produces sparse models, the fused lasso produces models that are not only sparse, but favor a locally constant coefficient profile.

Assume that we have observations i = 1, \dots, n. For each observation i, assume that we have feature data \boldsymbol{x}_i = ( x_{i1}, x_{i2}, \dots, x_{ip}), and an observation y_i which we want to predict. Both the lasso and fused lasso model the predictions as

\begin{aligned} f(\boldsymbol{x}_i) = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}. \end{aligned}

Where the methods differ is how they determine the coefficients \beta_0, \beta_1, \dots, \beta_p. The lasso estimates the coefficients as the solution to the minimization problem

\begin{aligned} \hat\beta = \underset{\beta}{\text{argmin}} \; \left[ \frac{1}{2n} \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p |\beta_j | \right], \end{aligned}

where \lambda \geq 0 is a hyperparameter. (Note that the intercept \beta_0 is not included in the penalty term: this is usually the case for regularized models.) The penalty term encourages the coefficients \beta_j to shrink to zero, giving sparsity.

The fused lasso estimates the coefficients as the solution to the minimization problem

\begin{aligned} \hat\beta = \underset{\beta}{\text{argmin}} \; \left[ \frac{1}{2n} \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda_1 \sum_{j=1}^p |\beta_j | + \lambda_2 \sum_{j=2}^p |\beta_j - \beta_{j-1} | \right], \end{aligned}

where \lambda_1 \geq 0 and \lambda_2 \geq 0 are hyperparameters. On top of penalizing the size of the coefficients, the fused lasso also penalizes the differences between successive coefficients. This encourages the coefficient profile (the plot of \hat\beta_j vs. j) to look piecewise-constant.

The minimization problem for the fused lasso is a quadratic program and can be solved with general QP solvers. There are of course techniques to make the solver efficient: the original paper (Reference 1) has some ideas. You can fit the fused lasso in R with the fusedlasso function in the genlasso package.

Why might you want to use the fused lasso? You might want to do so when the features are ordered in some meaningful way. Note that the lasso solution is invariant to the ordering of the features whereas the fused lasso solution is not. In Reference 1, they gave a motivating example from protein mass spectroscopy. For a particular sample i, they have intensity values x_{ij} for several mass over charge ratios (m/z). The goal is to find m/z-sites that discriminate between healthy and sick patients. The feature number j is related to the m/z value, so it might make sense to have a coefficient profile that is locally constant.

References:

  1. Tibshirani, R., et al. (2005). Sparsity and smoothness via the fused lasso.

What is a dgCMatrix object made of? (sparse matrix format in R)

I’ve been working with sparse matrices in R recently (those created using Matrix::Matrix with the option sparse=TRUE) and found it difficult to track down documentation about what the slots in the matrix object are. This post describes the slots in a class dgCMatrix object.

(Click here for full documentation of the Matrix package (and it is a lot–like, 215 pages a lot).)

Background

It turns out that there is some documentation on dgCMatrix objects within the Matrix package. One can access it using the following code:

library(Matrix)
?`dgCMatrix-class`

According to the documentation, the dgCMatrix class

…is a class of sparse numeric matrices in the compressed, sparse, column-oriented format. In this implementation the non-zero elements in the columns are sorted into increasing row order. dgCMatrix is the “standard” class for sparse numeric matrices in the Matrix package.

An example

We’ll use a small matrix as a running example in this post:

library(Matrix)
M <- Matrix(c(0, 0,  0, 2,
              6, 0, -1, 5,
              0, 4,  3, 0,
              0, 0,  5, 0),
            byrow = TRUE, nrow = 4, sparse = TRUE)
rownames(M) <- paste0("r", 1:4)
colnames(M) <- paste0("c", 1:4)
M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .

Running str on x tells us that the dgCMatrix object has 6 slots. (To learn more about slots and S4 objects, see this section from Hadley Wickham’s Advanced R.)

str(M)
# Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
# ..@ i       : int [1:7] 1 2 1 2 3 0 1
# ..@ p       : int [1:5] 0 1 2 5 7
# ..@ Dim     : int [1:2] 4 4
# ..@ Dimnames:List of 2
# .. ..$ : chr [1:4] "r1" "r2" "r3" "r4"
# .. ..$ : chr [1:4] "c1" "c2" "c3" "c4"
# ..@ x       : num [1:7] 6 4 -1 3 5 2 5
# ..@ factors : list()

x, i and p

If a matrix M has nn non-zero entries, then its x slot is a vector of length nn containing all the non-zero values in the matrix. The non-zero elements in column 1 are listed first (starting from the top and ending at the bottom), followed by column 2, 3 and so on.

M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .

M@x
# [1]  6  4 -1  3  5  2  5

as.numeric(M)[as.numeric(M) != 0]
# [1]  6  4 -1  3  5  2  5

The i slot is a vector of length nn. The kth element of M@i is the row index of the kth non-zero element (as listed in M@x). One big thing to note here is that the first row has index ZERO, unlike R’s indexing convention. In our example, the first non-zero entry, 6, is in the second row, i.e. row index 1, so the first entry of M@i is 1.

M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .

M@i
# [1] 1 2 1 2 3 0 1

If the matrix has nvars columns, then the p slot is a vector of length nvars + 1. If we index the columns such that the first column has index ZERO, then M@p[1] = 0, and M@p[j+2] - M@p[j+1] gives us the number of non-zero elements in column j.

In our example, when j = 2, M@p[2+2] - M@p[2+1] = 5 - 2 = 3, so there are 3 non-zero elements in column index 2 (i.e. the third column).

M
# 4 x 4 sparse Matrix of class "dgCMatrix"
#    c1 c2 c3 c4
# r1  .  .  .  2
# r2  6  . -1  5
# r3  .  4  3  .
# r4  .  .  5  .

M@p
# [1] 0 1 2 5 7

With the x, i and p slots, one can reconstruct the entries of the matrix.

Dim and Dimnames

These two slots are fairly obvious. Dim is a vector of length 2, with the first and second entries denoting the number of rows and columns the matrix has respectively. Dimnames is a list of length 2: the first element being a vector of row names (if present) and the second being a vector of column names (if present).

factors

This slot is probably the most unusual of the lot, and its documentation was a bit difficult to track down. From the CRAN documentation, it looks like factors is

… [an] Object of class “list” – a list of factorizations of the matrix. Note that this is typically empty, i.e., list(), initially and is updated automagically whenever a matrix factorization is
computed.

My understanding is if we perform any matrix factorizations or decompositions on a dgCMatrix object, it stores the factorization under factors so that if asked for the factorization again, it can return the cached value instead of recomputing the factorization. Here is an example:

M@factors
# list()

Mlu <- lu(M)  # perform triangular decomposition
str(M@factors)
# List of 1
# $ LU:Formal class 'sparseLU' [package "Matrix"] with 5 slots
# .. ..@ L  :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots
# .. .. .. ..@ i       : int [1:4] 0 1 2 3
# .. .. .. ..@ p       : int [1:5] 0 1 2 3 4
# .. .. .. ..@ Dim     : int [1:2] 4 4
# .. .. .. ..@ Dimnames:List of 2
# .. .. .. .. ..$ : chr [1:4] "r2" "r3" "r4" "r1"
# .. .. .. .. ..$ : NULL
# .. .. .. ..@ x       : num [1:4] 1 1 1 1
# .. .. .. ..@ uplo    : chr "U"
# .. .. .. ..@ diag    : chr "N"
# .. ..@ U  :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots
# .. .. .. ..@ i       : int [1:7] 0 1 0 1 2 0 3
# .. .. .. ..@ p       : int [1:5] 0 1 2 5 7
# .. .. .. ..@ Dim     : int [1:2] 4 4
# .. .. .. ..@ Dimnames:List of 2
# .. .. .. .. ..$ : NULL
# .. .. .. .. ..$ : chr [1:4] "c1" "c2" "c3" "c4"
# .. .. .. ..@ x       : num [1:7] 6 4 -1 3 5 5 2
# .. .. .. ..@ uplo    : chr "U"
# .. .. .. ..@ diag    : chr "N"
# .. ..@ p  : int [1:4] 1 2 3 0
# .. ..@ q  : int [1:4] 0 1 2 3
# .. ..@ Dim: int [1:2] 4 4

Here is an example which shows that the decomposition is only performed once:

set.seed(1)
M <- runif(9e6)
M[sample.int(9e6, size = 8e6)] <- 0
M <- Matrix(M, nrow = 3e3, sparse = TRUE)

system.time(lu(M))
#   user  system elapsed
# 13.527   0.161  13.701

system.time(lu(M))
#   user  system elapsed
#      0       0       0

An unofficial vignette for the gamsel package

I’ve been working on a project/package that closely mirrors that of GAMSEL (generalized additive model selection), a method for fitting sparse generalized additive models (GAMs). In preparing my package, I realized that (i) the gamsel package which implements GAMSEL doesn’t have a vignette, and (ii) I could modify the vignette for my package minimally to create one for gamsel. So here it is!

For a markdown version of the vignette, go here. Unfortunately LaTeX doesn’t play well on Github… The Rmarkdown file is available here: you can download it and knit it on your own machine.

Introduction

This is an unofficial vignette for the gamsel package. GAMSEL (Generalized Additive Model Selection) is a method for fitting sparse generalized additive models proposed by Alexandra Chouldechova and Trevor Hastie. Here is the abstract of the paper:

We introduce GAMSEL (Generalized Additive Model Selection), a penalized likelihood approach for fitting sparse generalized additive models in high dimension. Our method interpolates between null, linear and additive models by allowing the effect of each variable to be estimated as being either zero, linear, or a low-complexity curve, as determined by the data. We present a blockwise coordinate descent procedure for efficiently optimizing the penalized likelihood objective over a dense grid of the tuning parameter, producing a regularization path of additive models. We demonstrate the performance of our method on both real and simulated data examples, and compare it with existing techniques for additive model selection.

Let there be n observations, each consisting of p features. Let \mathbf{X} \in \mathbb{R}^{n \times p} denote the overall feature matrix, and let y \in \mathbb{R}^n denote the vector of responses. Let X_j \in \mathbb{R}^n denote the jth column of \mathbf{X}.

For each variable X_j, let U_j \in \mathbb{R}^{n \times m_j} be the matrix of evaluations of m_j basis functions u_1, \dots, u_{m_j} at points x_{1j}, \dots, x_{nj}. The model response from GAMSEL is of the form

\begin{aligned} \hat{y} = \alpha_0 + \sum_{j=1}^p \alpha_j X_j + U_j \beta_j, \end{aligned}

where \beta_j \in \mathbb{R}^{m_j}.

For more details on the method, see the arXiv paper. For `gamsel`’s official R documentation, see this link.

The gamsel() function

The purpose of this section is to give users a general sense of the gamsel() function, which is probably the most important function of the package. First, we load the gamsel package:

library(gamsel)
#> Loading required package: foreach
#> Loading required package: mda
#> Loading required package: class
#> Loaded mda 0.4-10
#> Loaded gamsel 1.8-1

Let’s generate some data:

set.seed(1)
n <- 100; p <- 12
x = matrix(rnorm((n) * p), ncol = p)
f4 = 2 * x[,4]^2 + 4 * x[,4] - 2
f5 = -2 * x[, 5]^2 + 2
f6 = 0.5 * x[, 6]^3
mu = rowSums(x[, 1:3]) + f4 + f5 + f6
y = mu + sqrt(var(mu) / 4) * rnorm(n)

We fit the model using the most basic call to gamsel():

fit <- gamsel(x, y)

The function gamsel() fits GAMSEL for a path of lambda values and returns a gamsel object. Typical usage is to have gamsel() specify the lambda sequence on its own. (By default, the model is fit for 50 different lambda values.) The returned gamsel object contains some useful information on the fitted model. For a given value of the \lambda hyperparameter, GAMSEL gives the predictions of the form

\begin{aligned} \hat{y} = \alpha_0 + \sum_{j=1}^p \alpha_j X_j + U_j \beta_j, \end{aligned}

where \alpha_0 \in \mathbb{R}, \alpha_j \in \mathbb{R} and \beta_j \in \mathbb{R}^{m_j} are fitted coefficients.

Printing the returned gamsel object tells us how many features, linear components and non-linear components were included in the model for each lambda value respectively. It also shows the fraction of null deviance explained by the model and the lambda value for that model.

fit
#> 
#> Call:  gamsel(x = x, y = y) 
#> 
#>       NonZero Lin NonLin    %Dev  Lambda
#>  [1,]       0   0      0 0.00000 80.1300
#>  [2,]       1   1      0 0.03693 72.9400
#>  [3,]       1   1      0 0.06754 66.4000
#>  [4,]       1   1      0 0.09290 60.4400
#>  [5,]       1   1      0 0.11390 55.0200
#>  [6,]       1   1      0 0.13130 50.0900
#>  [7,]       1   1      0 0.14580 45.5900
#>  .... (redacted for conciseness)

gamsel has a tuning parameter gamma which is between 0 and 1. Smaller values of gamma penalize the linear components less than the non-linear components, resulting in more linear components for the fitted model. The default value is gamma = 0.4.

fit2 <- gamsel(x, y, gamma = 0.8)

By default, each variable is given m_j = 10 basis functions. This can be modified with the degrees option, and this value can differ from variable to variable (to allow for this, pass a vector of length equal to the number of variables to the degrees option).

By default, the maximum degrees of freedom for each variable is 5. This can be modified with the dfs option, with larger values allowing more “wiggly” fits. Again, this value can differ from variable to variable.

Predictions

Predictions from this model can be obtained by using the predict method of the gamsel() function output: each column gives the predictions for a value of lambda.

# get predictions for all values of lambda
all_predictions <- predict(fit, x) dim(all_predictions) #> [1] 100  50

# get predictions for 20th model for first 5 observations
all_predictions[1:5, 20]
#> [1]  1.88361056 -4.47189543  8.05935149 -0.04271584  5.93270321

One can also specify the lambda indices for which predictions are desired:

# get predictions for 20th model for first 5 observations
predict(fit, x[1:5, ], index = 20)
#>              l20
#> [1,]  1.88361056
#> [2,] -4.47189543
#> [3,]  8.05935149
#> [4,] -0.04271584
#> [5,]  5.93270321

The predict method has a type option which allows it to return different types of information to the user. type = "terms" gives a matrix of fitted functions, with as many columns as there are variables. This can be useful for understanding the effect that each variable has on the response. Note that what is returned is a 3-dimensional array!

# effect of variables for first 10 observations and 20th model
indiv_fits <- predict(fit, x[1:10, ], index = c(20), type = "terms") dim(indiv_fits) #> [1] 10  1 12
# effect of variable 4 on these first 10 observations
plot(x[1:10, 4], indiv_fits[, , 4])

type = "nonzero" returns a list of indices of non-zero coefficients at a given lambda.

# variables selected by GAMSEL and the 10th and 50th lambda values
predict(fit, x, index = c(10, 50), type = "nonzero")
#> $l10
#> [1] 2 3 4 5 6
#> 
#> $l50
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12

Plots and summaries

Let’s fit the basic gamsel model again:

fit <- gamsel(x, y)

fit is a class “gamsel” object which comes with a plot method. The plot method shows us the relationship our predicted response has with each input feature, i.e. it plots \alpha_j X_j + U_j \beta_j vs. X_j for each j. Besides passing fit to the plot() call, the user must also pass an input matrix x: this is used to determine the coordinate limits for the plot. It is recommended that the user simply pass in the same input matrix that the GAMSEL model was fit on.

By default, plot() gives the fitted functions for the last value of the lambda key in fit, and gives plots for all the features. For high-dimensional data, this latter default is problematic as it will produce too many plots! You can pass a vector of indices to the which option to specify which features you want plots for. The code below gives plots for the first 4 features:

par(mfrow = c(1, 4))
par(mar = c(4, 2, 2, 2))
plot(fit, x, which = 1:4)

The user can specify the index of the lambda value to show using the index option:

# show fitted functions for x2, x5 and x8 at the model for the 15th lambda value
par(mfrow = c(1, 3))
plot(fit, x, index = 15, which = c(2, 5, 8))

Linear functions are colored green, non-linear functions are colored red, while zero functions are colored blue.

Class “gamsel” objects also have a summary method which allows the user to see the coefficient profiles of the linear and non-linear features. On each plot (one for linear features and one for non-linear features), the x-axis is the \lambda value going from large to small. For linear components, the y-axis is the coefficient for each variable while for the nonlinear components, it is the norm of the nonlinear coefficients.

par(mfrow = c(1, 2))
summary(fit)

We can include annotations to show which profile belongs to which feature by specifying label = TRUE.

par(mfrow = c(1, 2))
summary(fit, label = TRUE)

Cross-validation

We can perform k-fold cross-validation (CV) for GAMSEL with cv.gamsel(). It does 10-fold cross-validation by default:

cvfit <- cv.gamsel(x, y)

We can change the number of folds using the nfolds option:

cvfit <- cv.gamsel(x, y, nfolds = 5)

If we want to specify which observation belongs to which fold, we can do that by specifying the foldid option, which is a vector of length n, with the ith element being the fold number for observation i.

set.seed(3)
foldid <- sample(rep(seq(10), length = n))
cvfit <- cv.gamsel(x, y, foldid = foldid)

A cv.gamsel() call returns a cv.gamsel object. We can plot this object to get the CV curve with error bars (one standard error in each direction). The left vertical dotted line represents lambda.min, the lambda value which attains minimum CV error, while the right vertical dotted line represents lambda.1se, the largest lambda value with CV error within one standard error of the minimum CV error.

plot(cvfit)

The numbers at the top represent the number of features in our original input matrix that are included in the model.

The two special lambda values, as well as their indices in the lambda path, can be extracted directly from the cv.gamsel object:

# lambda values
cvfit$lambda.min
#> [1] 3.959832
cvfit$lambda.1se
#> [1] 8.39861
# corresponding lambda indices
cvfit$index.min
#> [1] 33
cvfit$index.1se
#> [1] 25

Logistic regression with binary data

In the examples above, y was a quantitative variable (i.e. takes values along the real number line). As such, using the default family = "gaussian" for gamsel() was appropriate. In theory, the GAMSEL algorithm is very flexible and can be used when y is not a quantitative variable. In practice, gamsel() has been implemented for binary response data. The user can use gamsel() (or cv.gamsel()) to fit a model for binary data by specifying family = "binomial". All the other functions we talked about above can be used in the same way.

In this setting, the response y should be a numeric vector containing just 0s and 1s. When doing prediction, note that by default predict() gives just the value of the linear predictors, i.e.

\begin{aligned} \log [\hat{p} / (1 - \hat{p})] = \hat{y} = \alpha_0 + \sum_{j=1}^p \alpha_j X_j + U_j \beta_j, \end{aligned}

where \hat{p} is the predicted probability. To get the predicted probability, the user has to pass type = "response" to the predict() call.

# fit binary model
bin_y <- ifelse(y > 0, 1, 0)
binfit <- gamsel(x, bin_y, family = "binomial")

# linear predictors for first 5 observations at 10th model
predict(binfit, x[1:5, ])[, 10]
#> [1]  0.1293867 -0.4531994  0.4528493 -0.2539989  0.3576436

# predicted probabilities for first 5 observations at 10th model
predict(binfit, x[1:5, ], type = "response")[, 10]
#> [1] 0.5323016 0.3886003 0.6113165 0.4368395 0.5884699