A deep dive into glmnet: predict.glmnet

I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.

In this post, instead of looking at one of the function options of glmnet, we’ll look at the predict method for a glmnet object instead. The object returned by glmnet (call it fit) has class "glmnet"; when we run predict(fit), it runs the predict method for class "glmnet" objects, i.e. predict.glmnet(fit).

For reference, here is the full signature of the predict.glmnet function/method (v3.0-2):

predict(object, newx, s = NULL, type = c("link",
  "response", "coefficients", "nonzero", "class"), exact = FALSE,
  newoffset, ...)

In the above, object is a fitted "glmnet" object (call it fit). Recall that every glmnet fit has a lambda sequence associated with it: this will be important in understanding what follows. (This sequence can be accessed via fit$lambda.)

For the rest of this post, we will use the following data example:

set.seed(1)
n <- 100; p <- 20
x <- matrix(rnorm(n * p), nrow = n)
beta <- matrix(c(rep(1, 5), rep(0, 15)), ncol = 1)
y <- x %*% beta + rnorm(n)

fit <- glmnet(x, y)

Function option: newx

newx is simply the new x matrix at which we want predictions for. So for example, if we want predictions for the training x matrix, we would do

predict(fit, x)

If no other arguments are passed, we will get a matrix of predictions, each column corresponding to predictions for each value of \lambda in fit$lambda. For our example, fit$lambda has length 68 and x consists of 100 rows/observations, so predict(fit, x) returns a 100 \times 68 matrix.

length(fit$lambda)
# [1] 68
dim(predict(fit, x))
# [1] 100  68

newx must be provided except when type="coefficients" or type="nonzero" (more on these types later).

Function option: newoffset

If the original glmnet call was fit with an offset, then an offset must be included in the predict call under the newoffset option. If not included, an error will be thrown.

set.seed(2)
offset <- rnorm(n)
fit2 <- glmnet(x, y, offset = offset)
predict(fit2, x)
# Error: No newoffset provided for prediction, yet offset used in fit of glmnet

The reverse is true, in that if the original glmnet call was NOT fit with an offset, then predict will not allow you to include an offset in the prediction, EVEN if you pass it the newoffset option. It does not throw a warning or error, but simply ignore the newoffset option. You have been warned! This is demonstrated in the code snippet below.

pred_no_offset <- predict(fit, x)
pred_w_offset <- predict(fit, x, offset = offset)
max(abs(pred_no_offset - pred_w_offset))
# [1] 0

Function option: s and exact

s indicates the \lambda values for which we want predictions at. If the user does not specify s, predict will give predictions for each of the \lambda values in fit$lambda.

(Why is this option named s and not the more intuitive lambda? In page 5 of this vignette, the authors say they made this choice “in case later we want to allow one to specify the model size in other ways”. lambda controls the model size in the sense that the larger it is, the more coefficients will be forced to zero. There are other ways to specify model size. For example, one could imagine a function option where we specify the number of non-zero coefficients we want in the model, or where we specify the maximum \ell_1 norm the coefficient vector can have. None of these other options have been implemented at the moment.)

If the user-specified s values all belong to fit$lambda, then predict pulls out the coefficients corresponding to those values and returns predictions. In this case, the exact option has no effect.

If the user-specified s value does NOT belong to fit$lambda, things get interesting. If exact=FALSE (the default), predict uses linear interpolation to make predictions. (More accurately, it does linear interpolation of the coefficients, which translates to linear interpolation of the predictions.) As stated in the documentation: “while this is often a good approximation, it can sometimes be a bit coarse”.

As a demonstration: In the snippet below, we look at the predictions at a value of \lambda that lies between the two largest values in fit$lambda. If the function does as the documentation says, the last line should give a value of 0 (to machine precision).

b1 <- as.numeric(predict(fit, x, s = fit$lambda[1]))
b2 <- as.numeric(predict(fit, x, s = fit$lambda[2]))
b3 <- as.numeric(predict(fit, x,
    s = 0.3*fit$lambda[1] + 0.7*fit$lambda[2]))
max(abs(b3 - (0.3*b1 + 0.7*b2)))
# [1] 3.885781e-16

What happens if we have values in s that are not within the range of fit$lambda? First, I would recommend using exact=TRUE because extrapolation beyond the range of fit$lambda is dangerous in general. In my little experiments, it looks like predict simply returns the predictions for the \lambda value in fit$lambda that is closest to s.

If exact=TRUE, predict merges s with fit$lambda to get a single (decreasing) \lambda sequence, refits the glmnet model, then returns predictions at the \lambda values in s. If your training data is very large, this refitting could take a long time.

One note when using exact=TRUE is that you have to pass in additional arguments in order for the refitting to happen. That’s because the fitted glmnet object does not contain all the ingredients needed to do refitting. For our example, to predict for fit we need to supply x and y as well. For more complicated glmnet calls, more options have to be provided.

predict(fit, x, s = fit$lambda[68] / 2, exact = TRUE)
# Error: used coef.glmnet() or predict.glmnet() with `exact=TRUE` 
# so must in addition supply original argument(s)  x and y  in order to 
# safely rerun glmnet
predict(fit, x, s = fit$lambda[68] / 2, exact = TRUE, x = x, y = y)
# glmnet correctly returns predictions...

Function option: type

The type option determines the type of prediction returned. type="coefficients" returns the model coefficients for the \lambda values in s as a sparse matrix. type="nonzero" returns a list, with each element being a vector of the features which have non-zero features. For example, the code snippet below shows that for the second and third \lambda values in fit$lambda, the features that have non-zero coefficients are feature 5 and features 3 and 5 respectively.

predict(fit, type = "nonzero", s = fit$lambda[2:3])
# $`1`
# [1] 5
# 
# $`2`
# [1] 3 5

For type="coefficients" and type="nonzero", the user does not have to provide a newx argument since the return value does not depend on where we want the predictions. For the rest of the possible values of type, newx is required.

For type="link" (the default) and type="response" it helps to know a little GLM theory. For a observation having values x \in \mathbb{R}^p, type="link" returns x^T \beta, where \beta is the coefficient vector corresponding to a \lambda value in s.

For type="response", x^T \beta is passed through the GLM’s inverse link function to return predictions on the y scale. For “gaussian” family it is still x^T \beta. For “binomial” and “poisson” families it is \exp(x^T \beta) / (1 + \exp(x^T \beta)) and \exp(x^T \beta) respectively. For “multinomial” it returns fitted probabilities and for “cox” it returns fitted relative risk.

The final possibility, type="class", applies only to “binomial” and “multinomial” families. For each observation, it simply returns the class with the highest predicted probability.

Bonus: The coef method

The coef method for glmnet is actually just a special case of the predict method. This can be seen from the source code:

coef.glmnet
# function (object, s = NULL, exact = FALSE, ...) 
#     predict(object, s = s, type = "coefficients", exact = exact, 
#             ...)
# <bytecode: 0x7ff3ae934f20>
# <environment: namespace:glmnet>

Bonus: predict.elnet, predict.lognet, …

If you inspect the class of the object returned by a glmnet call, you will realize that it has more than one class. In the code below, we see that “gaussian” family results in an “elnet” class object. (“binomial” family returns a “lognet” object, “poisson” family returns a “fishnet” object, etc.)

class(fit)
# [1] "elnet"  "glmnet"

These classes have their own predict methods as well, but they draw on this base predict.glmnet call. As an example, here is the code for predict.fishnet:

glmnet:::predict.fishnet
# function (object, newx, s = NULL, type = c("link", "response", 
#        "coefficients", "nonzero"), exact = FALSE, newoffset, ...) 
# {
#     type = match.arg(type)
#     nfit = NextMethod("predict")
#     switch(type, response = exp(nfit), nfit)
# }
# <bytecode: 0x7ff3ab622040>
# <environment: namespace:glmnet>

What happens here is that predict.glmnet is first called. If type is not "response", then we simply return whatever predict.glmnet would have returned. However, if type="response", then (i) we call predict.glmnet, and (ii) the predictions are passed through the function x \mapsto \exp(x) before being returned.

This is how predict is able to give the correct return output across the different family and type options.

A deep dive into glmnet: type.gaussian

I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.

In this post, we will look at the type.gaussian option.

For reference, here is the full signature of the glmnet function (v3.0-2):

glmnet(x, y, family = c("gaussian", "binomial", "poisson", "multinomial",
  "cox", "mgaussian"), weights, offset = NULL, alpha = 1,
  nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 0.01, 1e-04),
  lambda = NULL, standardize = TRUE, intercept = TRUE,
  thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 2 + 20,
  nvars), exclude, penalty.factor = rep(1, nvars), lower.limits = -Inf,
  upper.limits = Inf, maxit = 1e+05, type.gaussian = ifelse(nvars <
  500, "covariance", "naive"), type.logistic = c("Newton",
  "modified.Newton"), standardize.response = FALSE,
  type.multinomial = c("ungrouped", "grouped"), relax = FALSE,
  trace.it = 0, ...)

type.gaussian

According to the official R documentation,

Two algorithm types are supported for (only) family="gaussian". The default when nvar<500 is type.gaussian="covariance", and saves all inner-products ever computed. This can be much faster than type.gaussian="naive", which loops through nobs every time an inner-product is computed. The latter can be far more efficient for nvar >> nobs situations, or when nvar > 500.

Generally speaking there is no need for you as the user to change this option.

How do the fitting times compare?

I ran a timing simulation to compare the function run times of type.gaussian="naive" vs. type.gaussian="covariance" for a range of number of observations (nobs or n) and number of features (nvar or p). The results are shown below. (For the R code that generated these plots, see here.)

This first panel of boxplots shows the time taken for type.gaussian="naive" to complete as a fraction (or multiple) of that for type.gaussian="covariance" (each boxplot represents 5 simulation runs). As advertised, naive runs more slowly for small values of p but more quickly for large values of p. The difference seems to be more stark when n is larger.

This next plot shows the absolute fitting times: note the log scale on both the x and y axes.

So, what algorithms do these two options represent? What follows is based on Statistical Learning with Sparsity by Hastie, Tibshirani and Wainwright (free PDF here, p124 of the PDF, p113 of the book).

Let y_i \in \mathbb{R} denote the response for observation i, and let x_{ij} \in \mathbb{R} denote the value of feature j for observation i. Assume that the response and the features are standardized to have mean zero so that we don’t have to worry about fitting an intercept term. For each value of \lambda in lambda, glmnet is minimizing the following objective function:

\begin{aligned} \underset{\beta}{\text{minimize}} \quad J(\beta) = \frac{1}{2n}\displaystyle\sum_{i=1}^n \left(y_i - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \displaystyle\sum_{j=1}^p \left(\frac{1 - \alpha}{2}\beta_j^2 + \alpha |\beta_j | \right). \end{aligned}

We minimize the expression above by cyclic coordinate descent. That is, we cycle through the features j = 1, \dots, p. For each j, treat J as a function of \beta_j only (pretend that \beta_k for k \neq j are fixed), then update \beta_j to the value which minimizes J. It turns out that this update is very simple:

\begin{aligned} \beta_j \leftarrow \dfrac{\mathcal{S}_{\alpha \lambda} \left( \frac{1}{n}\sum_{i=1}^n r_i^{(j)}x_{ij} \right)}{\frac{1}{n}\sum_{i=1}^n x_{ij}^2 + (1-\alpha)\lambda}, \end{aligned}

where \mathcal{S} is the soft-thresholding operator and r_i^{(j)} = y_i - \sum_{k \neq j} \beta_k x_{ik} is the partial residual.

Both of the modes minimize J in this way. Where they differ is in how they keep track of the quantities needed to do the update above. From here, assume that the data has been standardized. (What follows works for unstandardized data as well but just has more complicated expressions.)

type.gaussian = "naive"

As the features are standardized, we can write the argument in the soft-thresholding operator as

\begin{aligned} \frac{1}{n}\sum_{i=1}^n r_i^{(j)}x_{ij} = \frac{1}{n}\sum_{i=1}^n r_i x_{ij} + \beta_j, \qquad (1)\end{aligned}

where r_i is the full residual for observation i. In this mode, we keep track of the full residuals r_i, i = 1, \dots, n.

  • At a coordinate descent step for feature j, if the coefficient \beta_j doesn’t change its value, no updating of r_i is needed. However, to get the LHS of (1) for the next feature (j+1), we need to make O(n) operations to compute the sum on the RHS of (1).
  • If \beta_j changes value, then we have to update the r_i‘s, then recompute the LHS of (1) for the next feature using the expression on the RHS. This also takes O(n) time.

All in all, a full cycle through all p variables costs O(np) operations.

type.gaussian = "covariance"

Ignoring the factor of \frac{1}{n}, note that the first term on the RHS of (1) can be written as

\begin{aligned} \sum_{i=1}^n r_i x_{ij}  = \langle x_j, y \rangle - \sum_{k : |\beta_k| > 0} \langle x_j, x_k \rangle \beta_k. \end{aligned}

In this mode, we compute all the inner products \langle x_j, y \rangle (p of them) which takes O(np) operations. For each k such that \beta_k \neq 0, we store the current values of \langle x_j, x_k \rangle \beta_k (there are p of them for each k).

  • At a coordinate descent step for feature k, if \beta_k \neq 0 and the beginning of the step and its value changes, we need to update the  \langle x_j, x_k \rangle \beta_k terms with O(p) operations. Then, to calculate \sum_{i=1}^n r_i x_{ij} for the next coordinate descent step, we only need O(q) operations, where q is the number of non-zero coefficients at the moment.
  • As such, if no new variables become non-zero in a full cycle through the features, one full cycle takes only O(pq) operations.
  • If a new feature x_k enters the model for the first time (i.e. \beta_k becomes non-zero), then we need to compute and store \langle x_j, x_k \rangle \beta_k for j =1, \dots, p, which takes O(np).

This form of updating avoids the O(n) updating needed at every step at each feature in naive mode. While we sometimes occur O(np) operations when a new variable enters the model, such events don’t happen often. Also, we have q \leq p, so if p is small or if n \gg p, the O(pq) operations for one full cycle pale in comparison with O(np) operations for naive updating.

A deep dive into glmnet: offset

I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.

In this post, we will look at the offset option.

For reference, here is the full signature of the glmnet function (v3.0-2):

glmnet(x, y, family = c("gaussian", "binomial", "poisson", "multinomial",
  "cox", "mgaussian"), weights, offset = NULL, alpha = 1,
  nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 0.01, 1e-04),
  lambda = NULL, standardize = TRUE, intercept = TRUE,
  thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 2 + 20,
  nvars), exclude, penalty.factor = rep(1, nvars), lower.limits = -Inf,
  upper.limits = Inf, maxit = 1e+05, type.gaussian = ifelse(nvars <
  500, "covariance", "naive"), type.logistic = c("Newton",
  "modified.Newton"), standardize.response = FALSE,
  type.multinomial = c("ungrouped", "grouped"), relax = FALSE,
  trace.it = 0, ...)

offset

According to the official R documentation, offset should be

A vector of length nobs that is included in the linear predictor (a nobs x nc matrix for the “multinomial” family).

Its default value is NULL: in that case, glmnet internally sets the offset to be a vector of zeros having the same length as the response y.

Here is some example code for using the offset option:

set.seed(1)
n <- 50; p <- 10
x <- matrix(rnorm(n * p), nrow = n)
y <- rnorm(n)
offset <- rnorm(n)

# fit model
fit1 <- glmnet(x, y, offset = offset)

If we specify offset in the glmnet call, then when making predictions with the model, we must specify the newoffset option. For example, if we want the predictions fit1 gives us at \lambda = 0.1 for the training data, not specifying newoffset will give us an error:

This is the correct code:

predict(fit1, x, s = 0.1, newoffset = offset)
#                 1
#  [1,]  0.44691399
#  [2,]  0.30013292
#  [3,] -1.68825225
#  [4,] -0.49655504
#  [5,]  1.20180199
#  ...

So, what does offset actually do (or mean)? Recall that glmnet is fitting a linear model. More concretely, our data is \{ (x_1, y_1), \dots, (x_n, y_n) \}, where the x_j \in \mathbb{R}^p are our features for observation j and y_j \in \mathbb{R} is the response for observation j. For each observation, we are trying to model some variable z_j as a linear combination of the features, i.e. z_j = \beta_0 + \beta_1^T x_j. z_j is a function of z_j; the function depends on the context. For example,

  • For ordinary regression, z_j = y_j, i.e. the response itself.
  • For logistic regression, z_j = \text{logit}(y_j) = \log \left(\dfrac{y_j}{1-y_j} \right).
  • For Poisson regression, z_j = \log(y_j).

So, we are trying to find \beta_0 and \beta_1 so that \beta_0 + \beta_1^T x_j is a good estimate for z_j. If we have an offset (e_1, \dots, e_n), then we are trying to find \beta_0 and \beta_1 so that \boldsymbol{e_j} + \beta_0 + \beta_1^T x_j is a good estimate for z_j.

Why might we want to use offsets? There are two primary reasons for them stated in the documentation:

Useful for the “poisson” family (e.g. log of exposure time), or for refining a model by starting at a current fit.

Let me elaborate. First, offsets are useful for Poisson regression. The official vignette has a little section explaining this; let me explain it through an example.

Imagine that we are trying to predict how many points an NBA basketball player will score per minute based on his physical attributes. If the player’s physical attributes (i.e. the covariates of our model) are denoted by x and then the number of points he scores in a minute is denoted by y, then Poisson regression assumes that

\begin{aligned} y &\sim \text{Poisson}(\mu(x)), \\ \log [\mu(x)] &= \beta_0 + \beta_1^T x. \end{aligned}

\beta_0 and \beta_1 are parameters of the model to be determined.

Having described the model, let’s turn to our data. For each player 1, \dots, n, we have physical covariates x_1, \dots, x_n. However, instead of having each player’s points per minute, we have number of points scored over a certain time period. For example, we might have “player 1 scored 12 points over 30 minutes” instead of “player 1 scored 0.4 points per minute”.

Offsets allow us to use our data as is. In our example above, loosely speaking 12/30 (points per minute) is our estimate for \mu(x_1). Hence, 12 (points in 30 minutes) is our estimate for 30 \mu(x_1). In our model, \beta_0 + \beta_1^T x is our estimate for \log [\mu(x_1)], and so our estimate for \log [30\mu(x_1)] would be \beta_0 + \beta_1^T x + \log 30. The \log 30 term is the “offset” to get the model prediction for our data as is.

Taking this to the full dataset: if player j scores p_j points in t_j minutes, then our offset would be the vector (\log t_1, \dots, \log t_n), and the response we would feed glmnet is (p_1, \dots, p_n).

The second reason one might want to use offsets is to improve on an existing model. Continuing the example above: say we have a friend who has trained a model (not necessarily a linear model) to predict \log [\mu(x)], but he did not use the player’s physical attributes. We think that we can improve on his predictions by adding physical attributes to the model. One refinement to our friend’s model could be

\log[\mu(x)] = \hat{\theta} + \beta_0 + \beta_1^T x,

where \hat{\theta} is the prediction of \log[\mu(x)] from our friend’s model. In this setting, the offsets are simply our friend’s predictions. For model training, we would provide the first model’s predictions on the training observations as the offset. To get predictions from the refinement on new observations, we would first compute the predictions from the first model, then use them as the newoffset option in the predict call.

A deep dive into glmnet: standardize

I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.

In this post, we will focus on the standardize option.

For reference, here is the full signature of the glmnet function (v3.0-2):

glmnet(x, y, family = c("gaussian", "binomial", "poisson", "multinomial",
  "cox", "mgaussian"), weights, offset = NULL, alpha = 1,
  nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 0.01, 1e-04),
  lambda = NULL, standardize = TRUE, intercept = TRUE,
  thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 2 + 20,
  nvars), exclude, penalty.factor = rep(1, nvars), lower.limits = -Inf,
  upper.limits = Inf, maxit = 1e+05, type.gaussian = ifelse(nvars <
  500, "covariance", "naive"), type.logistic = c("Newton",
  "modified.Newton"), standardize.response = FALSE,
  type.multinomial = c("ungrouped", "grouped"), relax = FALSE,
  trace.it = 0, ...)

Unless otherwise stated, n will denote the number of observations, p will denote the number of features, and fit will denote the output/result of the glmnet call. The data matrix is denoted by X \in \mathbb{R}^{n \times p} and the response is denoted by y \in \mathbb{R}^n.

standardize

When standardize = TRUE (default), columns of the data matrix x are standardized, i.e. each column of x has mean 0 and standard deviation 1. More specifically, we have that for each j = 1, \dots, p,

\displaystyle\sum_{i=1}^n X_{ij} = 0, and \sqrt{\displaystyle\sum_{i=1}^n \frac{X_{ij}^2}{n}} = 1.

Why might we want to do this? Standardizing our features before model fitting is common practice in statistical learning. This is because if our features are on vastly different scales, the features with larger scales will tend to dominate the action. (One instance where we might not want to standardize our features is if they are already all measured along the same scale, e.g. meters or kilograms.)

Notice that the standardization here is slightly different from that offered by the scale function: scale(x, center = TRUE, scale = TRUE) gives the standardization

\displaystyle\sum_{i=1}^n X_{ij} = 0, and \sqrt{\displaystyle\sum_{i=1}^n \frac{X_{ij}^2}{n-1}} = 1.

We verify this with a small data example. Generate data according to the following code:

n <- 100; p <- 5; true_p <- 2
set.seed(950)
X <- matrix(rnorm(n * p), nrow = n)
beta <- matrix(c(rep(1, true_p), rep(0, p - true_p)), ncol = 1)
y <- X %*% beta + 3 * rnorm(n)

Create a version of the data matrix which has standardized columns:

X_centered <- apply(X, 2, function(x) x – mean(x))
Xs <- apply(X_centered, 2, function(x) x / sqrt(sum(x^2) / n))

Next, we run glmnet on Xs and y with both possible options for standardize:

library(glmnet)
fit <- glmnet(Xs, y, standardize = TRUE)
fit2 <- glmnet(Xs, y, standardize = FALSE)

We can check that we get the same fit in both cases (modulo numerical precision):

sum(fit$lambda != fit2$lambda)
# 0
max(abs(fit$beta – fit2$beta))
# 6.661338e-16

The documentation notes that the coefficients returned are on the original scale. Let's confirm that with our small data set. Run glmnet with the original data matrix and standardize = TRUE:

fit3 <- glmnet(X, y, standardize = TRUE)

For each column j, our standardized variables are Z_j = \dfrac{X_j - \mu_j}{s_j}, where \mu_j and s_j are the mean and standard deviation of column j respectively. If \beta_j and \gamma_j represent the model coefficients of fit2 and fit3 respectively, then we should have

\begin{aligned} \beta_0 + \sum_{j=1}^p \beta_j Z_j &= \gamma_0 + \sum_{j=1}^p \gamma_j X_j, \\ \beta_0 + \sum_{j=1}^p \beta_j \frac{X_j - \mu_j}{s_j} &= \gamma_0 + \sum_{j=1}^p \gamma_j X_j, \\ \left( \beta_0 - \sum_{j=1}^p \frac{\mu_j \beta_j}{s_j} \right) + \sum_{j=1}^p \frac{\beta_j}{s_j} X_j &= \gamma_0 + \sum_{j=1}^p \gamma_j X_j, \end{aligned}

i.e. we should have \gamma_0 = \beta_0 - \sum_{j=1}^p \frac{\mu_j \beta_j}{s_j} and \gamma_j = \frac{\beta_j}{s_j} for j = 1, \dots, p. The code below checks that this is indeed the case (modulo numerical precision):

# get column means and SDs
X_mean <- colMeans(X)
X_sd <- apply(X_centered, 2, function(x) sqrt(sum(x^2) / n))

# check difference for intercepts
fit2_int <- coefficients(fit2)[1,]
fit3_int <- coefficients(fit3)[1,]
temp <- fit2_int - colSums(diag(X_mean / X_sd) %*% fit2$beta)
max(abs(temp - fit3_int))
# 1.110223e-16

# check difference for feature coefficients
temp <- diag(1 / X_sd) %*% fit2$beta
max(abs(temp - fit3$beta))
# 1.110223e-15

The discussion above has been for the standardization of x. What about standardization for y? The documentation notes that when family = "gaussian", y is automatically standardized, and the coefficients are unstandardized at the end of the procedure.

More concretely, let the mean and standard deviation of y be denoted by \mu_y and s_y respectively. If running glmnet on standardized y gives intercept \beta_0 and coefficients \beta_1, \dots, \beta_p, then glmnet on unstandardized y will give intercept \mu_y + s_y\beta_0 and coefficients s_y\beta_1, \dots, s_y\beta_p.

Again, this can be verified empirically:

# get mean and SD of y
y_mean <- mean(y)
y_sd <- sqrt(sum((y - y_mean)^2) / n)

# fit model with standardized y
fit4 <- glmnet(X, (y - y_mean) / y_sd, standardize = TRUE)

# check difference for intercepts
fit4_int <- coefficients(fit4)[1,]
temp <- fit4_int * y_sd + y_mean
max(abs(temp - fit3_int))
# 1.110223e-16

# check difference for feature coefficients
max(abs(y_sd * fit4$beta - fit3$beta))
# 8.881784e-16