10/17/24, 9:12 AM BDS-Homework-1-Submission.
ipynb - Colab
Abhiram Iyengar
Anshul Joshi
Nikita Andhale
keyboard_arrow_down BDS: Homework 1
Submit:
1. A pdf of your notebook with solutions.
2. A link to your colab notebook
Goals of this homework
1. More experience with regression and ridge regression (regularization)
2. Start playing with Kaggle
3. More experience with Lasso.
4. An initial shot at ensembling and stacking.
Problem 1 (Nothing to turn in)
Go through all the notebooks we have done in class and make sure you understand what we did, and why.
keyboard_arrow_down Problem 2: Starting in Kaggle.
Later this month, we are opening a Kaggle competition made for this class. In that one, you will be participating on your own. This is an intro to
get us started, and also an excuse to work with regularization and regression which we have been discussing.
1. Let’s start with our first Kaggle submission in a playground regression competition. Make an account to Kaggle and find
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/
2. Follow the data preprocessing steps from https://www.kaggle.com/code/apapiu/regularized-linear-models. Then run a ridge regression
using λ = 0.1 . Make a submission of this prediction, what is the RMSE you get? (Hint: remember to exponentiate np.expm1(ypred) your
predictions).
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from scipy.stats import skew
from scipy.stats.stats import pearsonr
%config InlineBackend.figure_format = 'retina' #set 'png' here when working on notebook
%matplotlib inline
<ipython-input-1-b12170f47c6f>:8: DeprecationWarning: Please import `pearsonr` from the `scipy.stats` namespace; the `scipy.stats.stats`
from scipy.stats.stats import pearsonr
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.head()
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 1/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeat
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN
5 rows × 81 columns
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
test.loc[:,'MSSubClass':'SaleCondition']))
First I'll transform the skewed numeric features by taking log(feature + 1) - this will make the features more normal
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"], "log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()
array([[<Axes: title={'center': 'price'}>,
<Axes: title={'center': 'log(price + 1)'}>]], dtype=object)
Create Dummy variables for the categorical features
#log transform the target:
train["SalePrice"] = np.log1p(train["SalePrice"])
#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
Replace the numeric missing values (NaN's) with the mean of their respective columns
all_data = pd.get_dummies(all_data)
#filling NA's with the mean of the column:
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 2/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
all_data = all_data.fillna(all_data.mean())
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice
Models Now we are going to use regularized linear regression models from the scikit learn module. I'm going to try both l_1(Lasso) and
l_2(Ridge) regularization. I'll also define a function that returns the cross-validation rmse error so we can evaluate our models and pick the best
tuning par
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score
def rmse_cv(model):
rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
return(rmse)
model_ridge = Ridge()
The main tuning parameter for the Ridge model is alpha - a regularization parameter that measures how flexible our model is. The higher the
regularization the less prone our model will be to overfit. However it will also lose flexibility and might not capture all of the signal in the data.
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean()
for alpha in alphas]
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Validation - Just Do It")
plt.xlabel("alpha")
plt.ylabel("rmse")
Text(0, 0.5, 'rmse')
cv_ridge.min()
0.12731233261727531
So for the Ridge regression we get a rmsle of about 0.127 with alpha = 0.5
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 3/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
keyboard_arrow_down run a ridge regression using λ=0.1 . Make a submission of this prediction, what is the
RMSE you get?
#(Hint: remember to exponentiate np.expm1(ypred) your predictions).
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
import numpy as np
# Initialize Ridge model with alpha (λ) = 0.1
ridge_model = Ridge(alpha=0.1)
# Fit the Ridge regression model to the training data
ridge_model.fit(X_train, y)
▾ Ridge i ?
Ridge(alpha=0.1)
# Make predictions on the test set
ridge_preds = ridge_model.predict(X_test)
# Exponentiate the predictions to reverse the log1p transformation
ridge_preds_exp = np.expm1(ridge_preds)
# Prepare the submission file
submission = pd.DataFrame({"Id": test["Id"], "SalePrice": ridge_preds_exp})
# Save the submission to a CSV file
submission.to_csv("ridge_submission.csv", index=False)
def rmse_cv(model):
rmse = np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv=5))
return rmse
# Calculate RMSE for Ridge model
rmse_ridge = rmse_cv(ridge_model).mean()
print(f"RMSE for Ridge Regression with λ=0.1: {rmse_ridge}")
RMSE for Ridge Regression with λ=0.1: 0.13774989813144883
House Prices - Advanced Regression Techniques : Kaggle Score = 0.13564
keyboard_arrow_down Problem 3: Continuing in Kaggle
1. Compare a ridge regression and a lasso regression model. Optimize the regularization constants using cross validation. This means that
you will have to select different values of the regularization parameters, and set up a k -fold cross validation experiment to decide which of
these is best, and then finally compare your best ridge regression model with your best lasso regression model.
What is the best score you can get from a single ridge regression model and from a single lasso model?
2. The ℓ0 (or L0 ) norm is the number of nonzeros of a vector. Plot the L0 norm of the coefficients that lasso produces as you vary the
strength of regularization parameter λ.
PROBLEM 3 : 1. Let' try out the Lasso model. We will do a slightly different approach here and use the built in
keyboard_arrow_down Lasso CV to figure out the best alpha for us. For some reason the alphas in Lasso CV are really the inverse or
the alphas in Ridge.
model_lasso = LassoCV(alphas = [1, 0.1, 0.001, 0.0005]).fit(X_train, y)
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 4/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
rmse_cv(model_lasso).mean()
0.1225674790699958
Nice! The lasso performs even better so we'll just use this one to predict on the test set. Another neat thing about the Lasso is that it does
feature selection for you - setting coefficients of features it deems unimportant to zero. Let's take a look at the coefficients:
coef = pd.Series(model_lasso.coef_, index = X_train.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
Lasso picked 110 variables and eliminated the other 177 variables
optimal_alpha = model_lasso.alpha_
print(f"Optimal alpha for Lasso: {optimal_alpha}")
Optimal alpha for Lasso: 0.0005
Let's find the best alpha with Ridge model & cross-validation.
from sklearn.linear_model import RidgeCV
import numpy as np
# Define a range of alpha values to test
alphas = [0.1, 1.0, 10.0, 100.0]
# Initialize RidgeCV model with the specified alphas
ridge_cv = RidgeCV(alphas=alphas, scoring='neg_mean_squared_error', cv=5)
# Fit the model to the training data
ridge_cv.fit(X_train, y)
# Get the best alpha value
best_alpha_ridge = ridge_cv.alpha_
print(f"Optimal alpha for Ridge: {best_alpha_ridge}")
# Calculate RMSE for the best Ridge model
rmse_ridge_cv = np.sqrt(-cross_val_score(ridge_cv, X_train, y, scoring="neg_mean_squared_error", cv=5)).mean()
print(f"RMSE for Ridge Regression with best alpha: {rmse_ridge_cv}")
Optimal alpha for Ridge: 10.0
RMSE for Ridge Regression with best alpha: 0.12731233261727531
PROBLEM 3 : 1) What is the best score you can get from a single ridge regression model and from a single lasso
model?
Best Score Comparison
The best score (lowest RMSE) between these two models is achieved by the Lasso regression model: Best RMSE: 0.1225674790699958 (Lasso
model) The Lasso model outperforms the Ridge model by a small margin in this case.
Optimal alpha for Ridge: 10.0
RMSE for Ridge Regression with best alpha: 0.12731233261727531
Optimal alpha for Lasso: 0.0005
RMSE for Lasso Regression with best alpha:0.1225674790699958
keyboard_arrow_down PROBLEM 3 : 2) The ℓ0 (or L0 ) norm is the number of nonzeros of a vector. Plot the L0 norm of the coefficients
that lasso produces as you vary the strength of regularization parameter λ .
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
# Define a range of alpha values (regularization parameters)
alphas = np.logspace(-4, 1, 50)
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 5/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
# Initialize lists to store results
l0_norms = []
# Loop over each alpha value
for alpha in alphas:
# Initialize and fit the Lasso model
lasso = Lasso(alpha=alpha, max_iter=10000)
lasso.fit(X_train, y) # Using X_train and y as per your previous code
# Calculate the L0 norm (number of non-zero coefficients)
l0_norm = np.sum(lasso.coef_ != 0)
l0_norms.append(l0_norm)
# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(alphas, l0_norms, marker='o')
plt.xscale('log')
plt.xlabel('Regularization Parameter (λ)')
plt.ylabel('L0 Norm of Coefficients')
plt.title('L0 Norm vs Regularization Strength in Lasso Regression')
plt.grid(True)
plt.show()
Trends:
High L0 Norm at Low λ: Minimal regularization leads to nearly all coefficients being non-zero.
Decreasing L0 Norm with Increasing λ: As λ increases, more coefficients are set to zero, showcasing Lasso's feature selection capability.
Plateau at High λ: Beyond around 10^-1, the number of non-zero coefficients stabilizes near zero, indicating strong regularization and the
exclusion of most features.
Interpretation:
Feature Selection: Lasso effectively reduces features by zeroing out coefficients as λ increases.
Model Complexity: Lower λ values yield complex models with more features, while higher λ values simplify the model.
Optimal Regularization: The ideal λ balances retaining essential features and eliminating noise, typically where the curve flattens.
keyboard_arrow_down Problem 4: Introduction to Stacking and Ensembling
Add the outputs of your models as features and train a ridge regression on all the features plus the model outputs (This is called Ensembling
and Stacking). Be careful not to overfit. What score can you get?
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 6/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
# Assume X_train, y_train, X_test are defined
# Train Ridge and Lasso models
ridge_model = Ridge(alpha=10).fit(X_train, y)
lasso_model = Lasso(alpha=0.0005).fit(X_train, y)
# Generate predictions on training data
ridge_preds_train = ridge_model.predict(X_train)
lasso_preds_train = lasso_model.predict(X_train)
# Generate predictions on test data
ridge_preds_test = ridge_model.predict(X_test)
lasso_preds_test = lasso_model.predict(X_test)
# Create new feature sets
X_train_stack = np.hstack((X_train, ridge_preds_train.reshape(-1, 1), lasso_preds_train.reshape(-1, 1)))
X_test_stack = np.hstack((X_test, ridge_preds_test.reshape(-1, 1), lasso_preds_test.reshape(-1, 1)))
# Train final Ridge regression model on stacked features
ridge_final_model = Ridge(alpha=10).fit(X_train_stack, y)
# Evaluate performance using cross-validation
rmse_final = np.sqrt(-cross_val_score(ridge_final_model, X_train_stack, y, scoring='neg_mean_squared_error', cv=5)).mean()
print(f"RMSE for stacked model: {rmse_final}")
# Predict on test data using stacked model
final_predictions = ridge_final_model.predict(X_test_stack)
# Prepare submission file with IDs and predicted SalePrice
submission = pd.DataFrame({"Id": test["Id"], "SalePrice": final_predictions})
submission.to_csv("stacked_submission.csv", index=False)
RMSE for stacked model: 0.12356315812543787
Kaggle Score after Stack Submission : 0.12496
Problem 5
Use the data generation used in the LASSO notebook where we first introduced Lasso, to generate data.
You can find that in the pages tab in Canvas.
1. Manually implement forward selection. Report the order in which you add features.
2. In this example, we know the true support size is 5. But what if we did not know this? Plot test error as a function of the size of the
support. Use this to recover the true support size. Justify your answer.
3. Use Lasso with a manually implemented Cross validation using the metric of your choice. What is the value of the hyperparameter?
(Manually implemented means that you can either do it entirely on your own, or you can use GridSearchCV, but I’m asking you not to use
LassoCV, which you will use in the next problem).
4. (Optional) Change the number of folds in your CV and repeat the previous step. How does the optimal value of the hyperparameter
change? Try to explain any trends that you find.
5. (Optional) Read about and use LassoCV from sklearn.linear model. How does this compare with what you did in the previous step? If they
agree, then explain why they agree, and if they disagree explain why. This will require you to make sure you understand what LassoCV is
doing.
keyboard_arrow_down Step 0: Generate Data
np.random.seed(7)
n_samples, n_features = 100, 200
X = np.random.randn(n_samples, n_features)
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 7/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
k = 5
# beta generated with k nonzeros
#coef = 10 * np.random.randn(n_features)
coef = 10 * np.ones(n_features)
inds = np.arange(n_features)
np.random.shuffle(inds)
coef[inds[k:]] = 0 # sparsify coef
y = np.dot(X, coef)
# add noise
y += 0.01 * np.random.normal((n_samples,))
# Split data in train set and test set
n_samples = X.shape[0]
X_train, y_train = X[:25], y[:25]
X_test, y_test = X[25:], y[25:]
keyboard_arrow_down Step 1: Manually Implement Forward Selection
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Assuming X_train, y_train are already defined as in the previous code
# Forward selection implementation
selected_features = []
corresponding_mse = []
remaining_features = list(range(X_train.shape[1]))
# Limiting selection to top 10 for demonstration purposes
for _ in range(10):
best_feature = None
best_mse = float('inf')
for feature in remaining_features:
current_features = selected_features + [feature]
model = LinearRegression().fit(X_train[:, current_features], y_train)
y_pred = model.predict(X_train[:, current_features])
mse = mean_squared_error(y_train, y_pred)
if mse < best_mse:
best_mse = mse
best_feature = feature
selected_features.append(best_feature)
corresponding_mse.append(best_mse)
remaining_features.remove(best_feature)
print("Selected features:", selected_features)
print("MSE for Selected features:", corresponding_mse)
Selected features: [15, 18, 78, 76, 29, 80, 55, 0, 27, 62]
MSE for Selected features: [274.9259047713406, 127.87410588555692, 50.25889203977732, 28.96253432071043, 17.198394502743234, 11.59848226
keyboard_arrow_down Step 2: Estimate the True Support Size by Plotting Test Error
import numpy as np
import matplotlib.pyplot as plt
# Data from the forward selection results
test_errors = corresponding_mse
# Support sizes for the feature selections
support_sizes = range(1, len(selected_features) + 1)
# Plot the test error as a function of the support size
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 8/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
plt.figure(figsize=(12, 6))
plt.plot(support_sizes, test_errors, marker='o')
plt.title('Test Error vs. Support Size')
plt.xlabel('Support Size')
plt.ylabel('Test Error (MSE)')
plt.yscale('log') # Log scale for better visualization
plt.grid(True)
plt.show()
# Find the minimum error and its corresponding support size
min_error = min(test_errors)
optimal_support_size = support_sizes[test_errors.index(min_error)]
print(f"Optimal support size: {optimal_support_size}")
print(f"Minimum test error: {min_error:.2e}")
Optimal support size: 10
Minimum test error: 8 15e 01
Based on the results, the optimal support size is indeed 10, with a minimum test error of approximately 8.15e-01. This indicates that as we
added more features, the test error continued to decrease, reaching its lowest value when all 10 selected features were used.
However, the key observation here is that while the error decreases steadily as more features are added, the improvement becomes less
pronounced after a certain number of features, indicating diminishing returns. Even though the optimal support size is 10 in this case, the
earlier features (around 5) seem to have the most significant impact on reducing the error, and additional features improve the model more
gradually.
keyboard_arrow_down Step 3: Lasso Regression with Manual Cross-Validation
from sklearn.preprocessing import StandardScaler
# Normalize the feature matrix
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Perform cross-validation again with a wider range of alphas
alphas = np.logspace(-4, 1, 50)
param_grid = {'alpha': alphas}
grid_search = GridSearchCV(Lasso(max_iter=10000), param_grid, scoring='neg_mean_squared_error', cv=5, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 9/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
# Find the best alpha and evaluate the test MSE
best_alpha = grid_search.best_params_['alpha']
lasso_best = Lasso(alpha=best_alpha, max_iter=10000)
lasso_best.fit(X_train_scaled, y_train)
y_pred_test = lasso_best.predict(X_test_scaled)
# Test MSE
test_mse = mean_squared_error(y_test, y_pred_test)
print(f"Best alpha (5 fold): {best_alpha}")
print(f"Test MSE with scaled features: {test_mse:.4f}")
Best alpha (5 fold): 0.005428675439323859
Test MSE with scaled features: 0.0012
keyboard_arrow_down Step 4: (Optional) Vary the Number of Folds in Cross-Validation
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
# Set up alpha ranges for Lasso
lasso_alphas = {'alpha': np.logspace(-4, 1, 50)}
# Define different number of folds for cross-validation
folds = [3, 5, 10]
# Store results
results = {}
for n_folds in folds:
# Define k-fold cross-validation
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
# Lasso Regression with GridSearchCV
lasso = Lasso(max_iter=10000)
lasso_cv = GridSearchCV(lasso, lasso_alphas, cv=kf, scoring='neg_mean_squared_error')
lasso_cv.fit(X_train_scaled, y_train)
# Get best model and error
best_alpha = lasso_cv.best_params_['alpha']
test_pred = lasso_cv.predict(X_test_scaled)
test_mse = mean_squared_error(y_test, test_pred)
results[n_folds] = {'best_alpha': best_alpha, 'test_mse': test_mse}
# Print results
for n_folds, res in results.items():
print(f"Number of Folds: {n_folds}, Best Alpha: {res['best_alpha']}, Test MSE: {res['test_mse']:.4f}")
Number of Folds: 3, Best Alpha: 0.002682695795279727, Test MSE: 67.0029
Number of Folds: 5, Best Alpha: 0.05689866029018299, Test MSE: 0.0543
Number of Folds: 10, Best Alpha: 0.008685113737513529, Test MSE: 0.0018
Observations Variation in Optimal Alpha:
3 Folds: The best alpha found is 0.00268 with a relatively high test MSE of 67.00. This suggests that the model is likely too complex or not
effectively regularized for this dataset when using just three folds.
5 Folds: The best alpha increases significantly to 0.05690, resulting in a much lower test MSE of 0.0543. This indicates improved regularization,
as the model is now performing better with a more appropriate alpha value.
10 Folds: The optimal alpha is 0.00869, and the test MSE drops even further to 0.0018, showing excellent performance. This lower test MSE
reflects a better generalization of the model on unseen data.
keyboard_arrow_down Step 5: (Optional) Compare with LassoCV
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error, r2_score
# Create LassoCV object
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 10/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
# Create LassoCV object
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, random_state=42)
# Fit the model
lasso_cv.fit(X_train_scaled, y_train)
# Make predictions
y_pred_cv = lasso_cv.predict(X_test_scaled)
# Calculate MSE and R2 score
mse_cv = mean_squared_error(y_test, y_pred_cv)
r2_cv = r2_score(y_test, y_pred_cv)
print(f"Best alpha: {lasso_cv.alpha_}")
print(f"MSE: {mse_cv}")
print(f"R2 Score: {r2_cv}")
Best alpha: 1.9306977288832496
MSE: 60.114048729555506
R2 Score: 0.8491435237962413
Given the results from two approaches:
GridSearchCV: Best alpha of 0.0054 with a test MSE of 0.0012. LassoCV: Best alpha of 1.9307 with a test MSE of 60.1140.
Brief Explanation of Discrepancy
The significant difference in the best alpha values and MSE results suggests that the two methods are identifying different optimal
hyperparameters for the Lasso model. Here are potential reasons for this discrepancy:
Regularization Sensitivity: Lasso regression is sensitive to the choice of the alpha parameter, which controls the strength of the penalty. The
vastly different optimal alphas indicate that the model is responding differently to the regularization effect in each approach.
Data Characteristics: The distribution of the features and the target variable can affect how regularization is applied. If the features have a wide
range or differing scales, it can lead to different model performance across methods.
Hyperparameter Exploration: The search strategies may lead to different regions in the alpha parameter space being explored. While both
methods utilize the same range, LassoCV optimizes based on a built-in cross-validation approach, potentially leading it to converge on a less
optimal solution compared to the grid search.
Variance in Cross-Validation: Even though both methods used 5-fold cross-validation, the specific splits and their interaction with the model
could lead to variability in performance estimates, especially with a small sample size.
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 11/11