0% found this document useful (0 votes)
11 views37 pages

MLT Lab Prep Guide

Uploaded by

mics2025006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views37 pages

MLT Lab Prep Guide

Uploaded by

mics2025006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

MLT Lab 1 Prep Guide: EDA and Feature Engineering

This lab focuses on the fundamentals of data handling with Pandas, performing Exploratory
Data Analysis (EDA), and preparing categorical data for machine learning models.

Q1: Pandas I/O and File Formats

This question tests your ability to read and write data in different formats using the Pandas
library.

Key Concepts & Definitions

●​ DataFrame: The primary data structure in Pandas, a 2D labeled array similar to a


spreadsheet.
●​ CSV (Comma-Separated Values): A simple text file where values are separated by
commas. It's lightweight and human-readable but doesn't store data types or metadata.
●​ Excel (.xlsx): A binary file format used by Microsoft Excel. It can store data types,
formulas, charts, and multiple sheets, making it richer than CSV but less portable.
●​ JSON (JavaScript Object Notation): A text-based format that stores data in key-value
pairs, similar to Python dictionaries. It preserves data structure and is widely used in web
applications and APIs.

import pandas as pd

# a) Read a CSV file


df = pd.read_csv('titanic.csv')
print("Original DataFrame shape:", df.shape)

# b) Write to Excel and JSON


df.to_excel('titanic.xlsx', index=False) # index=False avoids writing row numbers to the file
df.to_json('titanic.json', orient='records', lines=True) # common format for JSON

# c) Read back from all formats to verify


df_csv = pd.read_csv('titanic.csv')
df_excel = pd.read_excel('titanic.xlsx')
df_json = pd.read_json('titanic.json', orient='records', lines=True)
print("CSV shape:", df_csv.shape)
print("Excel shape:", df_excel.shape)
print("JSON shape:", df_json.shape)

# d) Append new rows


new_passengers = pd.DataFrame([
{'PassengerId': 1000, 'Survived': 1, 'Pclass': 1, 'Name': 'Doe, John', 'Sex': 'male', 'Age': 35},
{'PassengerId': 1001, 'Survived': 0, 'Pclass': 3, 'Name': 'Doe, Jane', 'Sex': 'female', 'Age': 28}
])

# Append to CSV
new_passengers.to_csv('titanic.csv', mode='a', header=False, index=False)

# Append to Excel (requires reading first)


with pd.ExcelWriter('titanic.xlsx', mode='a', engine='openpyxl', if_sheet_exists='overlay') as
writer:
new_passengers.to_excel(writer, header=False, index=False,
startrow=writer.sheets['Sheet1'].max_row)

# Append to JSON (read, extend, and rewrite)


import json
with open('titanic.json', 'r+') as f:
data = [json.loads(line) for line in f]
data.extend(new_passengers.to_dict('records'))
f.seek(0) # Go to the beginning of the file
for entry in data:
f.write(json.dumps(entry) + '\n')

Viva Voce Prep

●​ Q: What are the main differences between CSV, Excel, and JSON?
○​ A: CSV is plain text, lightweight, and universally supported but lacks data type
support. Excel is a binary format that supports data types, formulas, and multiple
sheets but is proprietary. JSON is text-based, human-readable, and preserves
data structures (like nested objects), making it ideal for web APIs.
●​ Q: Why is appending to a JSON file more complex than a CSV file?
○​ A: CSV is a line-by-line format, so you can simply add new lines at the end. A
standard JSON file is a single structured object (like a list of dictionaries). To
append, you must parse the entire object into memory, add the new data, and
then write the entire modified object back to the file.

Q2: Descriptive Statistics


This task is about summarizing the main characteristics of your dataset using statistical
measures.

Key Concepts & Definitions

●​ Central Tendency: Measures that describe the center of a dataset.


○​ Mean: The average value. Sensitive to outliers.
○​ Median: The middle value when sorted. Robust to outliers.
●​ Variability (or Dispersion): Measures that describe how spread out the data is.
○​ Standard Deviation: The average distance of data points from the mean. A
higher value means more spread.
○​ Min/Max: The minimum and maximum values, defining the range of the data.
●​ Frequency Count: For categorical data, this is the count of occurrences for each unique
category.

import pandas as pd
df = pd.read_csv('titanic.csv')

# a) Summary statistics for numerical columns


numerical_summary = df.describe()
print("Numerical Summary:\n", numerical_summary)

# c) Frequency counts for categorical columns


sex_counts = df['Sex'].value_counts()
embarked_counts = df['Embarked'].value_counts()
print("\nSex Distribution:\n", sex_counts)
print("\nEmbarked Distribution:\n", embarked_counts)

Viva Voce Prep

●​ Q: In the 'Age' column, if the mean is 30 and the median is 28, what does that
suggest?
○​ A: It suggests the distribution of age is slightly right-skewed. The mean is being
pulled higher by a few older passengers (outliers).
●​ Q: What does a high standard deviation in the 'Fare' column indicate?
○​ A: It indicates that the fares are widely spread out. There is a large variability in
how much passengers paid, likely due to the different classes (1st, 2nd, 3rd).
Q3: Exploratory Data Visualizations

📊
Visualizations help uncover patterns, anomalies, and relationships in the data that are not
obvious from summary statistics alone.

Key Concepts & Definitions

●​ Histogram: Visualizes the distribution of a single numerical variable by grouping


numbers into ranges ("bins") and showing the frequency of each bin.
●​ Pie Chart: Shows the composition of a categorical variable as slices of a whole. Best for
a small number of categories.
●​ Box Plot: Displays the five-number summary of a numerical variable (minimum, first
quartile, median, third quartile, and maximum). Excellent for spotting outliers.
●​ Correlation Heatmap: A graphical representation of the correlation matrix, showing the
pairwise correlation between numerical variables. Colors indicate the strength and
direction of the correlation.

import matplotlib.pyplot as plt

import seaborn as sns

import pandas as pd

df = pd.read_csv('titanic.csv')

# a) Histogram for Age

sns.histplot(df['Age'].dropna(), kde=True)

plt.title('Distribution of Passenger Age')

plt.xlabel('Age')

plt.ylabel('Frequency')

plt.show()
# b) Pie chart for Sex

sex_counts = df['Sex'].value_counts()

plt.pie(sex_counts, labels=sex_counts.index, autopct='%1.1f%%', startangle=90)

plt.title('Gender Distribution of Passengers')

plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

# c) Box plot for Fare

sns.boxplot(x=df['Fare'])

plt.title('Box Plot of Ticket Fare')

plt.xlabel('Fare')

plt.show()

# d) Correlation heatmap

# Select only numeric columns for correlation

numeric_cols = df.select_dtypes(include=['float64', 'int64'])

correlation_matrix = numeric_cols.corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

plt.title('Correlation Heatmap of Numerical Features')

plt.show()

Viva Voce Prep

●​ Q: How do you interpret a box plot?


○​ A: The box represents the interquartile range (IQR), containing the middle 50%
of the data. The line inside is the median. The "whiskers" extend to show the
range of the data, and points beyond the whiskers are potential outliers.
●​ Q: In the heatmap, what does a correlation of -0.5 between 'Pclass' and 'Fare'
mean?
○​ A: It indicates a moderate negative correlation. As the passenger class
number goes up (from 1st to 3rd), the ticket fare tends to go down, which makes
logical sense.

Q4: Feature Encoding

Machine learning models require numerical input, so categorical text data must be converted
into a numerical format. This process is called feature encoding.

Key Concepts & Definitions

●​ Label Encoding: Assigns a unique integer to each category (e.g., 'S' -> 0, 'C' -> 1, 'Q' ->
2). This implies an ordinal relationship (0 < 1 < 2), which is often undesirable.
●​ One-Hot Encoding (OHE): Creates a new binary (0 or 1) column for each category. For
a feature with k categories, it creates k new columns. This avoids the ordinality issue but
can create many columns.
●​ Dummy Encoding: A variation of OHE where it creates k-1 new columns. The k-th
category is implicitly represented when all other dummy columns are 0. This is done to
avoid multicollinearity, which can be a problem for linear models.

import pandas as pd

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Embarked': ['S', 'C', 'S', 'Q', 'C']})

print("Original DataFrame:\n", df)


# a) Label Encoding

le = LabelEncoder()

df['Embarked_LabelEncoded'] = le.fit_transform(df['Embarked'])

print("\nAfter Label Encoding:\n", df)

# b) One-Hot Encoding

ohe_df = pd.get_dummies(df['Embarked'], prefix='Embarked')

print("\nAfter One-Hot Encoding:\n", ohe_df)

# c) Dummy Encoding

dummy_df = pd.get_dummies(df['Embarked'], prefix='Embarked', drop_first=True)

print("\nAfter Dummy Encoding (drop_first=True):\n", dummy_df)

Viva Voce Prep

●​ Q: When should you use Label Encoding vs. One-Hot Encoding?


○​ A: Use Label Encoding only when the categories have a natural order (e.g.,
'low', 'medium', 'high'). Use One-Hot Encoding for nominal categories where no
such order exists (e.g., 'Embarked' locations) to prevent the model from
assuming a false order.
●​ Q: What is multicollinearity and why is Dummy Encoding used to prevent it?
○​ A: Multicollinearity is when independent variables in a regression model are
highly correlated. With full One-Hot Encoding, the sum of the new columns is
always 1, creating perfect correlation. By dropping one column (Dummy
Encoding), this perfect relationship is broken, which is crucial for linear
regression models to function correctly.
MLT Lab 2 Prep Guide: LINEAR REGRESSION

Key Concepts & Definitions

●​ Linear Regression: A statistical method for modeling the relationship between a


dependent variable (target) and one or more independent variables (features).
●​ Simple Linear Regression: One feature. Equation: y=β_0+β_1x.
●​ Multiple Linear Regression: More than one feature. Equation:
y=β_0+β_1x_1+...+β_nx_n.
●​ Ordinary Least Squares (OLS): A method to find the best-fit line by minimizing the sum
of the squared differences between the predicted values and actual values (residuals).
○​ Normal Equation (for Multiple Regression): A closed-form analytical solution
for OLS: β=(XTX)−1XTy.
●​ Gradient Descent (GD): An iterative optimization algorithm used to find the minimum of
a cost function (like Mean Squared Error). It repeatedly adjusts the model's parameters
in the direction opposite to the gradient of the cost function.
○​ Learning Rate (α): A hyperparameter that controls how big of a step the
algorithm takes during each iteration.

Q1. Implementing Simple Linear Regression from Scratch (OLS & GD)
a) Load the dataset and choose “area” as the feature and “price” as the target.
b) Write a function to implement Simple Linear Regression using Ordinary Least Squares (OLS)
(derive slope & intercept).
c) Write another function for Simple Linear Regression using Gradient Descent (GD) with a
learning
rate and fixed iterations.

# Q1 (a,b): Simple Linear Regression OLS

x = df["area"].values
y = df["price"].values
x_mean, y_mean = x.mean(), y.mean()
b1_ols = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2) #slope
b0_ols = y_mean - b1_ols * x_mean #intercept
y_pred_ols = b0_ols + b1_ols * x #equation
print("Q1(b) OLS:")
print(f"GD → slope: {float(b0_ols):.4f}, intercept: {float(b1_ols):.4f}")

plt.scatter(x, y, s=10)
plt.plot(np.sort(x), b0_ols + b1_ols * np.sort(x))
plt.show()

# Q1 (c): Simple Linear Regression Gradient Descent (standardized)


def simple_linear_regression_gd(x,y,lr=0.00000001,iters=1000):
m,c=0,0
n=len(x)
for _ in range(iters):
y_pred = m * x + c
dm = (-2/n) * np.sum(x * (y - y_pred))
dc = (-2/n) * np.sum(y - y_pred)

m -= lr * dm
c -= lr * dc
return m,c
m_gd,c_gd=simple_linear_regression_gd(x,y)
print(f"GD → slope: {float(m_gd):.4f}, intercept: {float(c_gd):.4f}")

plt.scatter(x, y, s=10)
plt.plot(np.sort(x), c_gd + m_gd * np.sort(x))
plt.show()

Q2. Multiple Linear Regression from Scratch (OLS & GD)


a) Use all numerical features (area, bedrooms, bathrooms, stories, parking) to predict price.
b) Implement Multiple Linear Regression using OLS (Normal Equation).
c) Implement Multiple Linear Regression using GD.

# Q2: Multiple Linear Regression OLS


X = df[["area", "bedrooms", "bathrooms", "stories", "parking"]].values
Xb = np.c_[np.ones((len(X), 1)), X] #add col for intercept
beta = np.linalg.pinv(Xb.T @ Xb) @ Xb.T @ y #calc beta
intercept_q2_ols = beta[0]
coefs_q2_ols = beta[1:]
y_pred_q2_ols = Xb @ beta #predicting prices
print("Q2(b) OLS:", intercept_q2_ols, coefs_q2_ols, r2_score(y, y_pred_q2_ols), rmse(y,
y_pred_q2_ols))

#2(c): MLR GD
def multiple_lr_gd_stable(X, y, lr=0.01, n_iter=20000):
X_mean, X_std = X.mean(axis=0), X.std(axis=0)
y_mean, y_std = y.mean(), y.std()
Xs = (X - X_mean) / X_std
ys = (y - y_mean) / y_std
n, d = Xs.shape
Xb = np.c_[np.ones((n, 1)), Xs]
beta = np.zeros(d + 1)
for _ in range(n_iter):
yhat = Xb @ beta
grad = (2 / n) * Xb.T @ (yhat - ys)
beta -= lr * grad

coefs = (y_std / X_std) * beta[1:]


intercept = y_mean + y_std * (beta[0] - np.sum((X_mean / X_std) * beta[1:]))

return intercept, coefs

intercept_q2_gd, coefs_q2_gd = multiple_lr_gd_stable(X, y)


y_pred_q2_gd = intercept_q2_gd + X @ coefs_q2_gd

print("Q2(c) GD:", intercept_q2_gd, coefs_q2_gd, r2_score(y, y_pred_q2_gd), rmse(y,


y_pred_q2_gd))

Viva Voce Prep

●​ Q: What is the main difference between OLS (Normal Equation) and Gradient
Descent?
○​ A: The Normal Equation is an analytical, direct solution that computes the
optimal parameters in one step. It's fast for small datasets but becomes very slow
if the number of features is large (over 10,000) because it requires inverting a
large matrix. Gradient Descent is an iterative algorithm that gradually converges
to the optimal solution. It works well with very large datasets and doesn't require
matrix inversion, but you need to choose a learning rate and number of iterations.
●​ Q: Why is feature scaling important for Gradient Descent but not for the Normal
Equation?
○​ A: In GD, features with different scales can cause the cost function to become a
very elongated oval. This makes the algorithm take a long, inefficient path to the
minimum. Scaling features (e.g., to a range of 0-1 or with a standard deviation of
1) makes the cost function more circular, allowing GD to converge much faster
and more reliably. The Normal Equation is a direct calculation and is unaffected
by feature scales.
Q3: Linear Regression using Scikit-Learn

Scikit-Learn provides an efficient, high-level implementation for building models.

from sklearn.linear_model import LinearRegression


from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv('Housing.csv')
features = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking']
X = df[features]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# a) Simple Linear Regression (area -> price)


X_train_simple = X_train[['area']]
X_test_simple = X_test[['area']]

slr_model = LinearRegression()
slr_model.fit(X_train_simple, y_train)
print(f"Simple LR - Intercept: {slr_model.intercept_}, Coefficient: {slr_model.coef_}")

# b) Multiple Linear Regression


mlr_model = LinearRegression()
mlr_model.fit(X_train, y_train)
print(f"\nMultiple LR - Intercept: {mlr_model.intercept_}")
print(f"Multiple LR - Coefficients: {dict(zip(features, mlr_model.coef_))}")

Viva Voce Prep

●​ Q: What are the main steps to build a model in scikit-learn?


○​ A: 1. Import the model class (e.g., from sklearn.linear_model import
LinearRegression). 2. Instantiate the model (model =
LinearRegression()). 3. Fit the model to the training data
(model.fit(X_train, y_train)). 4. Predict on new data
(model.predict(X_test)).
●​ Q: How do you interpret the coefficients from the multiple regression model?
○​ A: Each coefficient represents the average change in the target variable (price)
for a one-unit increase in that feature, holding all other features constant. For
example, if the coefficient for 'bedrooms' is 100,000, it means adding one
bedroom is associated with an average price increase of 100,000, assuming
area, bathrooms, etc., do not change.

ML Lab 3 Prep Guide: Advanced Regression Topics


This lab dives deeper into model validation by checking assumptions and introduces a
technique to improve model performance called regularization.

Q1 & Q2: Linear Regression Assumptions & VIF

A model is only reliable if its underlying assumptions are met.

Key Concepts & Definitions

1.​ Linearity: The relationship between features and the target is linear. Check with a
residuals vs. predicted values plot (should be random, no pattern).
2.​ Independence of Errors: The errors (residuals) are independent of each other.
Important for time-series data.
3.​ Homoscedasticity: The variance of the errors is constant across all levels of the
independent variables. Check the residuals vs. predicted values plot (should be a
constant band, not a cone shape).
4.​ Normality of Errors: The errors follow a normal distribution. Check with a Q-Q Plot or a
histogram of residuals.
5.​ No Multicollinearity: The independent variables are not highly correlated with each
other.
○​ Variance Inflation Factor (VIF): A metric to quantify multicollinearity. A common
rule of thumb is that a VIF > 5 or 10 indicates a problematic level of correlation.

#checking assumptions
#1.linearity
import matplotlib.pyplot as plt
plt.scatter(y_test,y_pred)
plt.xlabel("actual price")
plt.ylabel("predicted price")
plt.show()

import seaborn as sns


#2.normality of residuals
residuals=y_test-y_pred
sns.histplot(residuals,kde=True)
plt.show()

#3.homoscaedasicity
plt.scatter(y_pred,residuals)
plt.xlabel("predicted")
plt.ylabel("residuals")
plt.show()

#4.independence(durbin-watson test)-correlation analysis


import statsmodels.api as sm
from scipy import stats
x_train_num = x_train.astype(float)
x_const = sm.add_constant(x_train_num)
ols_model = sm.OLS(y_train,x_const).fit()
print("Durbin-Watson:", sm.stats.stattools.durbin_watson(ols_model.resid))

Viva Voce Prep

●​ Q: What is homoscedasticity, and what does it look like if it's violated


(heteroscedasticity)?
○​ A: Homoscedasticity means the errors have constant variance. In a residual plot,
this looks like a random cloud of points in a constant band around the zero line.
Heteroscedasticity is when the variance is not constant, often appearing as a
cone or fan shape in the residual plot. This can make your coefficient estimates
less reliable.

Q2: Use the above given dataset, do the VIF analysis and fix the multicollinearity issue if
exist.
#2.Use the above given dataset, do the VIF analysis and fix the multicollinearity issue if exist.
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Ensure all values in X are float
x = x.astype(float)

# Step 1: Calculate VIF for each feature


vif_data = []
for i in range(x.shape[1]):
vif = variance_inflation_factor(x.values, i)
vif_data.append((x.columns[i], vif))

vif_df = pd.DataFrame(vif_data, columns=["Feature", "VIF"])


print("VIF values:\n", vif_df)

# Step 2: Remove features with very high VIF (say > 10)
high_vif_features = vif_df[vif_df["VIF"] > 10]["Feature"].tolist()
print("\nFeatures to remove (VIF > 10):", high_vif_features)

x_reduced = x.drop(columns=high_vif_features)

# Step 3: Train Linear Regression again


x_train, x_test, y_train, y_test = train_test_split(x_reduced, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

# Step 4: Print results


print("\nMSE after VIF correction:", mean_squared_error(y_test, y_pred))
print("R² after VIF correction:", r2_score(y_test, y_pred))

●​ Q: Your VIF analysis shows that 'feature_A' has a VIF of 25. What does this mean
and what should you do?
○​ A: A VIF of 25 is very high and indicates severe multicollinearity. It means that
'feature_A' is highly predictable from the other features in the model. The best
course of action is to remove 'feature_A' from the model and then re-calculate the
VIF scores for the remaining features to see if the problem is resolved.
Q3 & Q4: Ridge Regression and Hyperparameter Tuning

Ridge Regression is a regularization technique used to prevent overfitting in linear models.

Key Concepts & Definitions

●​ Regularization: The process of adding a penalty term to the model's cost function to
discourage complexity (i.e., large coefficient values). This helps prevent overfitting and
improves the model's generalization to new data.
●​ Ridge Regression (L2 Regularization): Adds a penalty equal to the square of the
magnitude of the coefficients. The cost function becomes: MSE + $ \alpha
\sum_{i=1}^{n} \beta_i^2 $. This shrinks the coefficients towards zero but never makes
them exactly zero.
●​ Hyperparameter: A parameter whose value is set before the learning process begins.
For Ridge, the key hyperparameter is alpha (α), which controls the strength of the
penalty.
●​ GridSearchCV: A technique to perform hyperparameter tuning by exhaustively
searching through a specified grid of parameter values and finding the combination that
performs best based on cross-validation.

#3. Ridge Regression – Scratch Implementation

#3a) Load the dataset and use all numerical features (area, bedrooms, bathrooms, stories,
parking) to predict price.

features = ["area", "bedrooms", "bathrooms", "stories", "parking"]

x = df[features].values

y = df["price"].values.reshape(-1, 1)

# Add bias column

X_b = np.c_[np.ones((x.shape[0], 1)), x]

#3Normal Equation (OLS) with L2 penalty


# Ridge Regression Normal Equation

alpha = 10

I = np.eye(X_b.shape[1])

I[0, 0] = 0 # don't penalize bias

theta_ridge = np.linalg.inv(X_b.T.dot(X_b) + alpha * I).dot(X_b.T).dot(y)

print("Theta (Normal Eq Ridge):", theta_ridge.ravel())

#3Gradient Descent with L2 penalty

def ridge_gradient_descent(X, y, alpha=10, lr=0.00000001, iterations=1000):

m, n = X.shape

theta = np.random.randn(n, 1)

for i in range(iterations):

gradients = (2/m) * X.T.dot(X.dot(theta) - y) + 2 * alpha * theta

gradients[0] -= 2 * alpha * theta[0] # don't penalize bias

theta -= lr * gradients

return theta

theta_gd = ridge_gradient_descent(X_b, y)

print("Theta (Gradient Descent Ridge):", theta_gd.ravel())

#3Evaluate the model using MSE and R².

from sklearn.metrics import mean_squared_error, r2_score


y_pred_normal = X_b.dot(theta_ridge)

y_pred_gd = X_b.dot(theta_gd)

print("Normal Eq Ridge -> MSE:", mean_squared_error(y, y_pred_normal), "R²:", r2_score(y,


y_pred_normal))

print("GD Ridge -> MSE:", mean_squared_error(y, y_pred_gd), "R²:", r2_score(y, y_pred_gd))

Q4. Hyperparameter Tuning with GridSearch

a) Use GridSearchCV to tune alpha for Ridge over the range {0.01, 0.1, 1, 10, 100}.

b) Report the best alpha chosen by GridSearch.

c) Evaluate the final models on test data (MSE and R²).

from sklearn.linear_model import Ridge

from sklearn.model_selection import GridSearchCV

# GridSearchCV

ridge = Ridge()

param_grid = {"alpha": [0.01, 0.1, 1, 10, 100]}

grid = GridSearchCV(ridge, param_grid, cv=5, scoring="r2")

grid.fit(x_train, y_train)

print("Best alpha:", grid.best_params_["alpha"])

# Evaluate final model


best_ridge = grid.best_estimator_

y_pred = best_ridge.predict(x_test)

print("Final MSE:", mean_squared_error(y_test, y_pred))

print("Final R²:", r2_score(y_test, y_pred))

Viva Voce Prep

●​ Q: What is the purpose of Ridge Regression?


○​ A: Its main purpose is to address multicollinearity and prevent overfitting. By
adding a penalty for large coefficients, it forces the model to be simpler and rely
less on any single feature, making it more robust and better at generalizing to
unseen data.
●​ Q: What does the hyperparameter alpha control in Ridge Regression?
○​ A: alpha controls the amount of regularization. If alpha = 0, Ridge Regression
is identical to standard Linear Regression (OLS). As alpha increases, the
penalty becomes stronger, and the coefficients are shrunk more aggressively
towards zero. A very large alpha will lead to underfitting, where all coefficients
are close to zero.
●​ Q: Why do we use cross-validation (like in GridSearchCV)?
○​ A: If we tune hyperparameters on our test set, we risk "leaking" information from
the test set into our model selection process. The model might end up being
tuned specifically for that test set and won't perform well on new, unseen data.
Cross-validation allows us to find the best hyperparameters using only the
training data, keeping the test set completely "unseen" for a final, unbiased
evaluation.

ML Lab 4 Prep Guide: LASSO AND ELASTICNET

Q1. LASSO Regression


a) Implement LASSO Regression using the following objective function:
where:​
- hθ(x) is the prediction,​
- α is the regularization parameter.

b) Try at least two or more different values of α in your implementation. Evaluate the models
using Mean Squared Error (MSE) and R² score, and record how the choice of α affects the
results.

alphas = [1000, 10000]


results_lasso = {}

print("--- LASSO Regression Results ---")


for alpha_val in alphas:
lasso = Lasso(alpha=alpha_val, random_state=42)
lasso.fit(X_train_scaled, y_train)

y_pred = lasso.predict(X_test_scaled)

mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)
results_lasso[alpha_val] = {'MSE': mse, 'R2': r2, 'Coefficients': lasso.coef_}

print(f"\nAlpha = {alpha_val}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.4f}")
print("Coefficients:", lasso.coef_)

Of course. Here's a clear explanation of the code and a set of potential viva voce (oral
examination) questions to help you prepare.

Code Explanation

This Python script uses the scikit-learn library to perform LASSO (Least Absolute
Shrinkage and Selection Operator) Regression. The main goal is to train and evaluate two
separate LASSO models using two different, very high values for the regularization parameter,
alpha.

Step-by-Step Breakdown
1.​ Initialization
○​ alphas = [1000, 10000]: This list holds the two values for the alpha
hyperparameter that will be tested. alpha controls the strength of the
regularization. A higher alpha value imposes a stronger penalty on the model's
coefficients.
○​ results_lasso = {}: An empty dictionary is created to store the evaluation
metrics (MSE, R²) and the learned coefficients for each alpha value.
2.​ Iterating Through Alpha Values
○​ for alpha_val in alphas:: This loop runs the entire training and
evaluation process once for each value in the alphas list (first for 1000, then for
10000).
3.​ Model Creation and Training
○​ lasso = Lasso(alpha=alpha_val, random_state=42): An instance of
the LASSO regression model is created.
■​ alpha=alpha_val: The regularization strength is set to the current
value from the loop.
■​ random_state=42: This ensures that the results are reproducible. If the
algorithm had a random component, it would produce the same result
every time it's run.
○​ lasso.fit(X_train_scaled, y_train): This is the training step. The
model learns the relationship between the features (X_train_scaled) and the
target variable (y_train). The term _scaled suggests the training features
have been standardized or normalized, which is crucial for LASSO to work
effectively.
4.​ Prediction and Evaluation
○​ y_pred = lasso.predict(X_test_scaled): The trained model is used to
make predictions on new, unseen data (X_test_scaled).
○​ mse = mean_squared_error(y_test, y_pred): The Mean Squared
Error is calculated. This metric measures the average of the squares of the
errors—that is, the average squared difference between the actual values
(y_test) and the predicted values (y_pred). Lower is better.
○​ r2 = r2_score(y_test, y_pred): The R-squared (R²) score is calculated.
This represents the proportion of the variance in the dependent variable that is
predictable from the independent variables. It ranges from -∞ to 1. Closer to 1 is
better.
5.​ Storing and Displaying Results
○​ results_lasso[alpha_val] = ...: The calculated MSE, R² score, and the
model's learned coefficients (lasso.coef_) are stored in the dictionary, using
the alpha value as the key.
○​ print(...): The results for the current alpha are printed to the console for
immediate review. The coefficients are particularly important for LASSO, as they
show which features the model has selected or discarded.

💡 Key Takeaway
The purpose of this code is to demonstrate the effect of a very strong regularization penalty
in LASSO regression. By using large alpha values like 1000 and 10000, the experiment is set
up to force many of the model's coefficients (lasso.coef_) to become exactly zero. This
process is known as feature selection.

Viva Voce Preparation 👨‍🏫


Here are potential questions an examiner might ask about this code and the underlying
concepts.

Fundamental Concepts

Q1: What is LASSO Regression?

●​ Answer: LASSO is a linear regression model that includes a regularization term. Its
main purpose is to prevent overfitting and perform automatic feature selection. It does
this by adding a penalty proportional to the absolute value of the magnitude of the
coefficients. This is also known as L1 Regularization.

Q2: What is the role of the alpha parameter?

●​ Answer: alpha is the regularization strength parameter. It's a hyperparameter that


controls the trade-off between fitting the data well and keeping the model's coefficients
small.
○​ A high alpha (like 1000 or 10000 in your code) creates a large penalty, forcing
many coefficients to become exactly zero. This results in a simpler model
(feature selection).
○​ A low alpha (close to 0) reduces the penalty, making the LASSO model behave
almost like a standard Linear Regression model.

Q3: What is the difference between LASSO (L1) and Ridge (L2) Regression?

●​ Answer: The primary difference lies in the penalty term.


○​ LASSO (L1) uses the sum of the absolute values of the coefficients (λ∑∣βj​∣).
This can shrink coefficients to exactly zero, making it useful for feature selection.
○​ Ridge (L2) uses the sum of the squared values of the coefficients (λ∑βj2​). This
shrinks coefficients close to zero but rarely makes them exactly zero. It's better
for handling multicollinearity.

Code-Specific Questions

Q4: In your code, you use X_train_scaled. Why is feature scaling important for
LASSO?

●​ Answer: LASSO's penalty is based on the size of the coefficients. If features are on
different scales (e.g., one feature from 0-1 and another from 100-10,000), the feature
with the larger scale will have a smaller coefficient to begin with, and the penalty will
unfairly affect features with smaller scales. Scaling (like Standardization or
Normalization) ensures all features are on a comparable scale, so the LASSO penalty is
applied fairly to all of them.

Q5: Looking at your alpha values (1000, 10000), what do you expect to see in the
lasso.coef_ output?

●​ Answer: I expect to see that for alpha=1000, many of the coefficients will be zero. For
alpha=10000, which is an even stronger penalty, I expect even more (or potentially
all) coefficients to be zero. The model is being heavily "punished" for having any
non-zero coefficients.

Q6: How would you expect the R² score to change when you increase alpha from 1000 to
10000?

●​ Answer: I expect the R² score to decrease (get worse). Because a higher alpha forces
more coefficients to zero, the model becomes simpler. These extremely high alpha
values are likely to make the model too simple (i.e., cause underfitting), to the point
where it can't capture the underlying patterns in the data, leading to a poorer R² score.

Interpretation and Follow-up Questions

Q7: What does it mean if a coefficient becomes zero?

●​ Answer: It means the LASSO model has determined that the corresponding feature is
not important for predicting the target variable and has effectively removed it from the
model. This is the feature selection property of LASSO.

Q8: If your R² score was negative, what would that signify?

●​ Answer: A negative R² score means the model is performing worse than a simple
horizontal line representing the mean of the target variable (y_test). It's a strong
indicator that the model is a very poor fit for the data, which can easily happen with an
overly aggressive alpha value that creates a model that is too simple.

Q9: How would you choose the best value for alpha in a real-world project?

●​ Answer: I wouldn't just test two arbitrary values. I would use a technique like
Cross-Validation (e.g., LassoCV in scikit-learn or GridSearchCV) to systematically
test a range of alpha values and find the one that provides the best performance on
unseen data, balancing model complexity and predictive power.

Q2. ElasticNet Regression


a) Implement ElasticNet Regression using the following objective function:

where:​
- α controls the strength of regularization,​
- λ₁ and λ₂ balance between LASSO and Ridge penalties.

b) Experiment with different values of α and l1_ratio. Compare the model’s performance using
MSE and R² score, and report how the parameter choices affect the results compared to
LASSO.

params = [{'alpha': 1000, 'l1_ratio': 0.5}, {'alpha': 1000, 'l1_ratio': 0.9}]


results_elastic = {}

print("\n--- ElasticNet Regression Results ---")


for p in params:
elastic_net = ElasticNet(alpha=p['alpha'], l1_ratio=p['l1_ratio'], random_state=42)
elastic_net.fit(X_train_scaled, y_train)

y_pred = elastic_net.predict(X_test_scaled)

mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)
key = f"alpha={p['alpha']}, l1_ratio={p['l1_ratio']}"
results_elastic[key] = {'MSE': mse, 'R2': r2, 'Coefficients': elastic_net.coef_}

print(f"\n{key}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.4f}")
print("Coefficients:", elastic_net.coef_)

This Python script trains and evaluates an ElasticNet regression model using two different
sets of hyperparameters. It then stores and prints the performance metrics (MSE and R²) and
the model's coefficients for each hyperparameter combination.

Code Explanation

1.​ Initialization:
○​ params = [...]: This is a list of dictionaries. Each dictionary specifies a
combination of hyperparameters to be tested. In this case, you're testing two
sets: (alpha=1000, l1_ratio=0.5) and (alpha=1000, l1_ratio=0.9).
○​ results_elastic = {}: An empty dictionary is created to store the
evaluation results for each model trained.
2.​ Iteration and Training Loop:
○​ for p in params:: The code iterates through each dictionary (p) in the
params list.
○​ elastic_net = ElasticNet(...): Inside the loop, a new ElasticNet
model is created for each set of parameters.
■​ alpha=p['alpha']: Sets the overall regularization strength. A higher
alpha value imposes a stronger penalty on the model's coefficients.
■​ l1_ratio=p['l1_ratio']: Sets the mix between L1 (Lasso) and L2
(Ridge) penalties. An l1_ratio of 0.5 means the penalty is 50% L1
and 50% L2.
■​ random_state=42: Ensures that the results are reproducible every time
the code is run.
○​ elastic_net.fit(X_train_scaled, y_train): The model is trained
using the scaled training features (X_train_scaled) and the training target
values (y_train).
3.​ Prediction and Evaluation:
○​ y_pred = elastic_net.predict(X_test_scaled): The trained model is
used to make predictions on the unseen (and scaled) test data.
○​ mse = mean_squared_error(...) and r2 = r2_score(...): The code
calculates two key performance metrics: Mean Squared Error (MSE) and the
R-squared (R²) score by comparing the actual test values (y_test) with the
model's predictions (y_pred).
4.​ Storing and Displaying Results:
○​ key = f"...": A descriptive string key (e.g., "alpha=1000, l1_ratio=0.5") is
created for the current hyperparameter set.
○​ results_elastic[key] = {...}: The calculated MSE, R2 score, and the
learned model coefficients (elastic_net.coef_) are stored in the
results_elastic dictionary under this key.
○​ print(...): The results for the current iteration are printed to the console for
immediate review.

Viva Voce Preparation 🎤


Here are some common questions an examiner might ask about this code and the concepts
behind it.

Q1: What is the primary purpose of this code?

A: The primary purpose is to perform a simple hyperparameter comparison for an ElasticNet


regression model. It trains the model with two different sets of alpha and l1_ratio values,
evaluates their performance on a test set using MSE and R² metrics, and prints the results to
see which set of hyperparameters performs better.

Q2: What is ElasticNet Regression and why would you use it?

A: ElasticNet is a regularized linear regression model that linearly combines the L1 (Lasso)
and L2 (Ridge) penalties.

You use it to:

1.​ Prevent Overfitting: Regularization penalizes large coefficients, leading to simpler


models that generalize better to new data.
2.​ Perform Feature Selection: The L1 part of the penalty can shrink some feature
coefficients to exactly zero, effectively removing them from the model.
3.​ Handle Multicollinearity: It's particularly useful when you have several features that are
highly correlated. Lasso might randomly pick one and discard the others, while
ElasticNet tends to group and shrink their coefficients together.
Q3: Explain the alpha and l1_ratio hyperparameters.

A:

●​ alpha (α): This is the regularization strength parameter. It's a constant that multiplies
the L1 and L2 penalty terms.
○​ A high alpha (like 1000 in your code) imposes a strong penalty, which will
shrink the coefficients more aggressively towards zero.
○​ An alpha of 0 removes regularization entirely, making it equivalent to a standard
Linear Regression.
●​ l1_ratio (ρ): This parameter controls the mix between L1 and L2 regularization. Its
value is between 0 and 1.
○​ l1_ratio = 1: The model is a Lasso Regression.
○​ l1_ratio = 0: The model is a Ridge Regression.
○​ 0 < l1_ratio < 1: The model is a combination of both. For example,
l1_ratio=0.5 means the penalty is an equal mix of L1 and L2 norms.

The ElasticNet penalty term is: α×ρ×∣∣w∣∣1​+α×21−ρ​×∣∣w∣∣22​

Q4: The input data is named X_train_scaled. Why is scaling important


for ElasticNet?

A: Scaling (e.g., using StandardScaler or MinMaxScaler) is crucial for regularized models


like ElasticNet. The model's penalty term is applied to the magnitude of the coefficients. If
features are on different scales (e.g., one feature ranges from 0-1 and another from
1000-50000), the model will unfairly penalize the feature with the larger scale. Scaling ensures
that all features contribute equally to the penalty term, leading to a more fair and accurate
model.

Q5: How do you interpret the MSE and R² scores that are printed?

A:

●​ Mean Squared Error (MSE): This is the average of the squared differences between the
predicted and actual values. A lower MSE is better, with 0 being a perfect score. Its
units are the square of the target variable's units, which can make it hard to interpret
directly.
●​ R² Score (Coefficient of Determination): This indicates the proportion of the variance
in the target variable that is explained by the model.
○​ An R² of 1 means the model perfectly predicts the data.
○​ An R² of 0 means the model performs no better than simply predicting the mean
of the target variable.
○​ A negative R² means the model is worse than the baseline mean model.
○​ A higher R² is better.

Q6: What do the printed Coefficients represent? What is the significance


if a coefficient is zero?

A: The coefficients (elastic_net.coef_) represent the learned weight or importance for


each feature in the training data (X_train_scaled). A coefficient tells you how much the
target variable is expected to change for a one-unit increase in that feature, holding all other
features constant.

If a coefficient is zero, it means the model has determined that this feature is not useful for
making predictions. The L1 penalty component of ElasticNet is responsible for this, effectively
performing automatic feature selection.

Q3. Polynomial Regression


a) Load the dataset and select “area” as the input feature and “price” as the target.

b) Transform the input feature using polynomial terms as:

for polynomial degrees d = 2, 3, 4.

c) Implement Polynomial Regression using PolynomialFeatures and LinearRegression.


Evaluate the models using MSE and R² score.

X_poly_train = X_train[['area']]
X_poly_test = X_test[['area']]

degrees = [2, 3, 4]

results_poly = {}

print("\n--- Polynomial Regression Results ---")

for d in degrees:

poly_reg = Pipeline([

('poly', PolynomialFeatures(degree=d, include_bias=False)),

('lin_reg', LinearRegression())

])

poly_reg.fit(X_poly_train, y_train)

y_pred = poly_reg.predict(X_poly_test)

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

results_poly[d] = {'MSE': mse, 'R2': r2}

print(f"\nDegree = {d}")

print(f"Mean Squared Error (MSE): {mse:.2f}")

print(f"R² Score: {r2:.4f}")


This script performs Polynomial Regression to model the relationship between a single feature
('area') and a target variable. It systematically tests different polynomial degrees (2, 3, and 4),
evaluates each model's performance, and prints the results to find the best degree of polynomial
fit.

Code Explanation
1.​ Feature Selection:
○​ X_poly_train = X_train[['area']]: Instead of using the entire dataset, this line
creates a new DataFrame X_poly_train that contains only the 'area' column
from the original training set. The same is done for the test set. This is done to
build a model based solely on this single feature.
2.​ Setup for Iteration:
○​ degrees = [2, 3, 4]: A list containing the different polynomial degrees to be tested.
The loop will run three times, creating a quadratic (degree 2), cubic (degree 3),
and quartic (degree 4) model.
○​ results_poly = {}: An empty dictionary is initialized to store the performance
metrics (MSE and R²) for each degree tested.
3.​ Model Building and Training Loop:
○​ for d in degrees:: The code iterates through each degree in the degrees list.
○​ poly_reg = Pipeline([...]): This creates a scikit-learn Pipeline. A pipeline is an
excellent tool that chains multiple data processing steps together. Here, it
combines two steps:
1.​ ('poly', PolynomialFeatures(...)): The first step generates new polynomial
features. For an input feature x (area), PolynomialFeatures(degree=d) will
create new features x2,x3,...,xd.
■​ include_bias=False: This is set to False because the subsequent
LinearRegression step automatically handles the intercept (bias)
term.
2.​ ('lin_reg', LinearRegression()): The second step is a standard
LinearRegression model. This model takes the newly created polynomial
features (e.g., area and area2) as its input and finds the best linear fit for
them.
○​ poly_reg.fit(X_poly_train, y_train): The entire pipeline is trained. The 'area' data is
first transformed into polynomial features, which are then used to train the linear
regression model.
4.​ Prediction and Evaluation:
○​ y_pred = poly_reg.predict(X_poly_test): The trained pipeline is used to make
predictions. The test data (X_poly_test) automatically undergoes the same
PolynomialFeatures transformation before the prediction is made.
○​ mse = mean_squared_error(...) and r2 = r2_score(...): The Mean Squared Error
(MSE) and R-squared (R²) score are calculated to evaluate how well the model
performed on the unseen test data.
5.​ Storing and Displaying Results:
○​ results_poly[d] = {...}: The calculated metrics are stored in the results_poly
dictionary, using the degree d as the key.
○​ print(...): The degree and its corresponding performance metrics are printed to
the console.

Viva Voce Preparation 🎤


Here are potential questions and answers to help you prepare for an oral examination on this
topic.

Q1: What is Polynomial Regression and why is it used?


A: Polynomial Regression is a type of regression analysis that models the relationship between
an independent variable x and a dependent variable y as an nth degree polynomial. While it fits
a nonlinear relationship, it's still considered a linear model because the regression function is
linear in terms of the unknown parameters (the coefficients).

It's used when a simple straight line (standard linear regression) is not sufficient to capture the
underlying trend in the data. It allows for a more flexible, curved fit.

Q2: Explain the purpose of the Pipeline in this code.


A: The Pipeline is used to streamline the workflow. It chains the feature transformation
(PolynomialFeatures) and the final model (LinearRegression) into a single object. This is
beneficial because:
1.​ Simplicity: We can call .fit() and .predict() just once on the pipeline, and it handles all
the intermediate steps automatically.
2.​ Data Leakage Prevention: It ensures that the transformations are learned from the
training data only and are then applied consistently to the test data during prediction.
3.​ Organization: It keeps the modeling process clean and easy to follow.

Q3: What does PolynomialFeatures(degree=3) do to a feature named 'area'?


A: If you pass the 'area' feature to PolynomialFeatures(degree=3), it will transform the single
feature [area] into a set of three features: [area,area2,area3]. The subsequent LinearRegression
model then fits the equation:

y=β1​(area)+β2​(area2)+β3​(area3)+β0​

Q4: What is the risk of choosing a very high degree (e.g., 20) for the
polynomial?
A: The biggest risk is overfitting. A high-degree polynomial is extremely flexible and can wiggle
its way through every single data point in the training set, resulting in a very high R² score on
that data. However, it will likely fail to generalize to new, unseen data because it has learned the
noise in the training set, not the underlying pattern. This will result in poor performance (high
MSE, low R²) on the test set.

Q5: Looking at the printed results, how would you choose the "best"
degree?
A: You would typically look for the degree that gives the best performance on the test set,
which usually means the highest R² score and the lowest MSE.

However, if two degrees (e.g., 3 and 4) give very similar performance on the test set, it's often
better to choose the simpler model (degree 3) based on the principle of Occam's Razor. A
simpler model is less likely to be overfitted and is more generalizable.

Q6: Could you use Polynomial Regression with more than one feature, for
example, 'area' and 'bedrooms'? What would PolynomialFeatures do then?
A: Yes, absolutely. If you provided two features, 'area' (x1​) and 'bedrooms' (x2​), to
PolynomialFeatures(degree=2), it would generate not only the squared terms but also the
interaction terms. The output features would be: [x1​,x2​,x12​,x1​x2​,x22​]. This allows the model
to capture how the features interact with each other.

Q4. Hyperparameter Tuning using GridSearch and Randomized


Search
a) Define a parameter grid to search for the optimal polynomial degree K.

b) Use GridSearchCV to find the best value of K by performing cross-validation. Report the best
polynomial degree and the corresponding MSE and R² score.

# Use the same single-feature data from Q3

X_poly = df[['area']]
y_poly = df['price']

pipeline = Pipeline([

('poly', PolynomialFeatures(include_bias=False)),

('regressor', LinearRegression())

])

param_grid = {'poly__degree': [1, 2, 3, 4, 5, 6]}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error',


n_jobs=-1)

grid_search.fit(X_poly, y_poly)

print("\n--- GridSearchCV Results ---")

print(f"Best Polynomial Degree (K): {grid_search.best_params_['poly__degree']}")

# Evaluate the best model on the test set

best_model = grid_search.best_estimator_

y_pred_best = best_model.predict(X_poly_test)

mse_best = mean_squared_error(y_test, y_pred_best)

r2_best = r2_score(y_test, y_pred_best)

print(f"Test MSE for best model: {mse_best:.2f}")

print(f"Test R² Score for best model: {r2_best:.4f}")


c) Similarly, use RandomizedSearchCV to search for the optimal polynomial degree K.

param_dist = {'poly__degree': range(1, 11)} # Search degrees from 1 to 10

random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=5,


cv=5,

scoring='neg_mean_squared_error', n_jobs=-1, random_state=42)

random_search.fit(X_poly, y_poly)

print("\n--- RandomizedSearchCV Results ---")

print(f"Best Polynomial Degree (K): {random_search.best_params_['poly__degree']}")

# Evaluate the best model found on the test set

best_model_random = random_search.best_estimator_

y_pred_best_random = best_model_random.predict(X_poly_test)

mse_best_random = mean_squared_error(y_test, y_pred_best_random)

r2_best_random = r2_score(y_test, y_pred_best_random)

print(f"Test MSE for best model: {mse_best_random:.2f}")

print(f"Test R² Score for best model: {r2_best_random:.4f}")

This code automates the process of finding the best hyperparameter (in this case, the
polynomial degree) for a regression model using two common techniques: Grid Search and
Randomized Search. It then evaluates the best model found by each technique on a separate
test set.
Code Explanation
1. Common Setup
●​ Data Selection: The code selects the single feature area as the input (X_poly) and price
as the target (y_poly) from the main dataframe df.
●​ Pipeline Definition: A Pipeline is created to first generate polynomial features from the
'area' column and then feed them into a LinearRegression model. This pipeline is the
base model that will be tuned.

2. GridSearchCV
●​ Goal: To exhaustively search through a manually specified list of hyperparameter values
to find the best one.
●​ param_grid: This dictionary defines the search space. 'poly__degree': [1, 2, 3, 4, 5, 6]
tells Grid Search to test every single integer degree from 1 to 6 for the poly step in the
pipeline. The double underscore __ is the syntax used to access parameters of a step
within a pipeline.
●​ GridSearchCV(...): This object orchestrates the search.
○​ cv=5: It uses 5-fold cross-validation. The data is split into 5 parts. For each
degree, the model is trained 5 times, each time using 4 parts for training and 1
part for validation. This prevents overfitting to a single train-test split and gives a
more robust performance estimate.
○​ scoring='neg_mean_squared_error': This is the metric used to judge which
degree is best. Scikit-learn's convention is that higher scores are better. Since we
want to minimize Mean Squared Error (MSE), we use its negative
(neg_mean_squared_error), which we want to maximize.
○​ n_jobs=-1: Uses all available CPU cores to speed up the computation.
●​ .fit(X_poly, y_poly): This command starts the search. It will train a total of 6 (degrees) ×
5 (folds) = 30 models.
●​ Evaluation: After finding the best degree, the code retrieves the best_estimator_ (the
pipeline retrained on all data with the best degree) and evaluates its final performance
on the unseen test set (X_poly_test).

3. RandomizedSearchCV
●​ Goal: To efficiently search for the best hyperparameters by randomly sampling from a
given distribution of values.
●​ param_dist: Defines the distribution to sample from. Here, range(1, 11) means degrees
can be any integer from 1 to 10.
●​ RandomizedSearchCV(...): This object runs the randomized search.
○​ n_iter=5: This is the key parameter. Instead of trying all 10 possible degrees, it
will randomly select and test only 5 of them.
●​ .fit(X_poly, y_poly): This starts the search. It will train a total of 5 (iterations) × 5 (folds)
= 25 models.
●​ Evaluation: The process of retrieving the best model and evaluating it on the test set is
identical to Grid Search.

Viva Voce Preparation 🎤


Q1: What is the main difference between GridSearchCV and
RandomizedSearchCV?
A: The main difference is how they explore the hyperparameter space.
●​ Grid Search is exhaustive. It systematically tries every single combination of
parameters you provide in the param_grid. This guarantees it will find the best
combination within that grid, but it can be very slow if the search space is large.
●​ Randomized Search is stochastic. It randomly samples a fixed number of
combinations (n_iter) from the parameter distributions. It doesn't try everything, but it's
much faster and often finds a "good enough" or even the best solution, especially when
only a few hyperparameters have a real impact on the model's performance.

Q2: In this code, why is cv=5 used? What problem does cross-validation
solve?
A: cv=5 specifies 5-fold cross-validation. It's a technique to get a reliable estimate of a
model's performance. It solves the problem of evaluation variance. If you use a single
train-validation split, your performance metric might be high or low simply due to the luck of how
the data was split. By training and validating the model 5 times on 5 different subsets of the data
and averaging the results, cross-validation provides a much more stable and trustworthy
measure of the model's true performance.

Q3: Explain the syntax poly__degree in the param_grid dictionary.


A: This syntax is specific to scikit-learn's Pipeline and search tools. It's used to specify a
parameter for a specific step within the pipeline.
●​ poly: This is the name we gave to the PolynomialFeatures step in our pipeline.
●​ __ (double underscore): This is the delimiter that separates the step name from the
parameter name.
●​ degree: This is the actual parameter of the PolynomialFeatures class that we want to
tune.​
So, poly__degree tells the search algorithm: "Find the step named 'poly' and set its
'degree' parameter to these values."
Q4: Why does the code evaluate the best_estimator_ on a test set at the
end? Isn't the score from the cross-validation enough?
A: This is a crucial concept. The cross-validation score is used to select the best
hyperparameter (the best degree). However, this score is still derived from the same data that
was used for the tuning process. Therefore, it can be slightly optimistic.

The final evaluation on a completely held-out test set (data the model has never seen during
training or tuning) provides an unbiased estimate of how the model will perform in the real
world on new data. This is the true measure of the model's generalization ability.

Q5: When would you strongly prefer RandomizedSearchCV over


GridSearchCV?
A: You'd strongly prefer RandomizedSearchCV when you have a large hyperparameter
space. Imagine you wanted to tune 3 different hyperparameters, each with 10 possible values.
●​ GridSearchCV would have to train 10 × 10 × 10 = 1000 models.
●​ RandomizedSearchCV could get a very good result by training far fewer models, for
example, n_iter=50. It's much more computationally efficient and is the preferred method
when dealing with many hyperparameters.

You might also like