MLT Lab Prep Guide
MLT Lab Prep Guide
This lab focuses on the fundamentals of data handling with Pandas, performing Exploratory
Data Analysis (EDA), and preparing categorical data for machine learning models.
This question tests your ability to read and write data in different formats using the Pandas
library.
import pandas as pd
# Append to CSV
new_passengers.to_csv('titanic.csv', mode='a', header=False, index=False)
● Q: What are the main differences between CSV, Excel, and JSON?
○ A: CSV is plain text, lightweight, and universally supported but lacks data type
support. Excel is a binary format that supports data types, formulas, and multiple
sheets but is proprietary. JSON is text-based, human-readable, and preserves
data structures (like nested objects), making it ideal for web APIs.
● Q: Why is appending to a JSON file more complex than a CSV file?
○ A: CSV is a line-by-line format, so you can simply add new lines at the end. A
standard JSON file is a single structured object (like a list of dictionaries). To
append, you must parse the entire object into memory, add the new data, and
then write the entire modified object back to the file.
import pandas as pd
df = pd.read_csv('titanic.csv')
● Q: In the 'Age' column, if the mean is 30 and the median is 28, what does that
suggest?
○ A: It suggests the distribution of age is slightly right-skewed. The mean is being
pulled higher by a few older passengers (outliers).
● Q: What does a high standard deviation in the 'Fare' column indicate?
○ A: It indicates that the fares are widely spread out. There is a large variability in
how much passengers paid, likely due to the different classes (1st, 2nd, 3rd).
Q3: Exploratory Data Visualizations
📊
Visualizations help uncover patterns, anomalies, and relationships in the data that are not
obvious from summary statistics alone.
import pandas as pd
df = pd.read_csv('titanic.csv')
sns.histplot(df['Age'].dropna(), kde=True)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
# b) Pie chart for Sex
sex_counts = df['Sex'].value_counts()
plt.show()
sns.boxplot(x=df['Fare'])
plt.xlabel('Fare')
plt.show()
# d) Correlation heatmap
correlation_matrix = numeric_cols.corr()
plt.show()
Machine learning models require numerical input, so categorical text data must be converted
into a numerical format. This process is called feature encoding.
● Label Encoding: Assigns a unique integer to each category (e.g., 'S' -> 0, 'C' -> 1, 'Q' ->
2). This implies an ordinal relationship (0 < 1 < 2), which is often undesirable.
● One-Hot Encoding (OHE): Creates a new binary (0 or 1) column for each category. For
a feature with k categories, it creates k new columns. This avoids the ordinality issue but
can create many columns.
● Dummy Encoding: A variation of OHE where it creates k-1 new columns. The k-th
category is implicitly represented when all other dummy columns are 0. This is done to
avoid multicollinearity, which can be a problem for linear models.
import pandas as pd
le = LabelEncoder()
df['Embarked_LabelEncoded'] = le.fit_transform(df['Embarked'])
# b) One-Hot Encoding
# c) Dummy Encoding
Q1. Implementing Simple Linear Regression from Scratch (OLS & GD)
a) Load the dataset and choose “area” as the feature and “price” as the target.
b) Write a function to implement Simple Linear Regression using Ordinary Least Squares (OLS)
(derive slope & intercept).
c) Write another function for Simple Linear Regression using Gradient Descent (GD) with a
learning
rate and fixed iterations.
x = df["area"].values
y = df["price"].values
x_mean, y_mean = x.mean(), y.mean()
b1_ols = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2) #slope
b0_ols = y_mean - b1_ols * x_mean #intercept
y_pred_ols = b0_ols + b1_ols * x #equation
print("Q1(b) OLS:")
print(f"GD → slope: {float(b0_ols):.4f}, intercept: {float(b1_ols):.4f}")
plt.scatter(x, y, s=10)
plt.plot(np.sort(x), b0_ols + b1_ols * np.sort(x))
plt.show()
m -= lr * dm
c -= lr * dc
return m,c
m_gd,c_gd=simple_linear_regression_gd(x,y)
print(f"GD → slope: {float(m_gd):.4f}, intercept: {float(c_gd):.4f}")
plt.scatter(x, y, s=10)
plt.plot(np.sort(x), c_gd + m_gd * np.sort(x))
plt.show()
#2(c): MLR GD
def multiple_lr_gd_stable(X, y, lr=0.01, n_iter=20000):
X_mean, X_std = X.mean(axis=0), X.std(axis=0)
y_mean, y_std = y.mean(), y.std()
Xs = (X - X_mean) / X_std
ys = (y - y_mean) / y_std
n, d = Xs.shape
Xb = np.c_[np.ones((n, 1)), Xs]
beta = np.zeros(d + 1)
for _ in range(n_iter):
yhat = Xb @ beta
grad = (2 / n) * Xb.T @ (yhat - ys)
beta -= lr * grad
● Q: What is the main difference between OLS (Normal Equation) and Gradient
Descent?
○ A: The Normal Equation is an analytical, direct solution that computes the
optimal parameters in one step. It's fast for small datasets but becomes very slow
if the number of features is large (over 10,000) because it requires inverting a
large matrix. Gradient Descent is an iterative algorithm that gradually converges
to the optimal solution. It works well with very large datasets and doesn't require
matrix inversion, but you need to choose a learning rate and number of iterations.
● Q: Why is feature scaling important for Gradient Descent but not for the Normal
Equation?
○ A: In GD, features with different scales can cause the cost function to become a
very elongated oval. This makes the algorithm take a long, inefficient path to the
minimum. Scaling features (e.g., to a range of 0-1 or with a standard deviation of
1) makes the cost function more circular, allowing GD to converge much faster
and more reliably. The Normal Equation is a direct calculation and is unaffected
by feature scales.
Q3: Linear Regression using Scikit-Learn
df = pd.read_csv('Housing.csv')
features = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking']
X = df[features]
y = df['price']
slr_model = LinearRegression()
slr_model.fit(X_train_simple, y_train)
print(f"Simple LR - Intercept: {slr_model.intercept_}, Coefficient: {slr_model.coef_}")
1. Linearity: The relationship between features and the target is linear. Check with a
residuals vs. predicted values plot (should be random, no pattern).
2. Independence of Errors: The errors (residuals) are independent of each other.
Important for time-series data.
3. Homoscedasticity: The variance of the errors is constant across all levels of the
independent variables. Check the residuals vs. predicted values plot (should be a
constant band, not a cone shape).
4. Normality of Errors: The errors follow a normal distribution. Check with a Q-Q Plot or a
histogram of residuals.
5. No Multicollinearity: The independent variables are not highly correlated with each
other.
○ Variance Inflation Factor (VIF): A metric to quantify multicollinearity. A common
rule of thumb is that a VIF > 5 or 10 indicates a problematic level of correlation.
#checking assumptions
#1.linearity
import matplotlib.pyplot as plt
plt.scatter(y_test,y_pred)
plt.xlabel("actual price")
plt.ylabel("predicted price")
plt.show()
#3.homoscaedasicity
plt.scatter(y_pred,residuals)
plt.xlabel("predicted")
plt.ylabel("residuals")
plt.show()
Q2: Use the above given dataset, do the VIF analysis and fix the multicollinearity issue if
exist.
#2.Use the above given dataset, do the VIF analysis and fix the multicollinearity issue if exist.
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Ensure all values in X are float
x = x.astype(float)
# Step 2: Remove features with very high VIF (say > 10)
high_vif_features = vif_df[vif_df["VIF"] > 10]["Feature"].tolist()
print("\nFeatures to remove (VIF > 10):", high_vif_features)
x_reduced = x.drop(columns=high_vif_features)
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
● Q: Your VIF analysis shows that 'feature_A' has a VIF of 25. What does this mean
and what should you do?
○ A: A VIF of 25 is very high and indicates severe multicollinearity. It means that
'feature_A' is highly predictable from the other features in the model. The best
course of action is to remove 'feature_A' from the model and then re-calculate the
VIF scores for the remaining features to see if the problem is resolved.
Q3 & Q4: Ridge Regression and Hyperparameter Tuning
● Regularization: The process of adding a penalty term to the model's cost function to
discourage complexity (i.e., large coefficient values). This helps prevent overfitting and
improves the model's generalization to new data.
● Ridge Regression (L2 Regularization): Adds a penalty equal to the square of the
magnitude of the coefficients. The cost function becomes: MSE + $ \alpha
\sum_{i=1}^{n} \beta_i^2 $. This shrinks the coefficients towards zero but never makes
them exactly zero.
● Hyperparameter: A parameter whose value is set before the learning process begins.
For Ridge, the key hyperparameter is alpha (α), which controls the strength of the
penalty.
● GridSearchCV: A technique to perform hyperparameter tuning by exhaustively
searching through a specified grid of parameter values and finding the combination that
performs best based on cross-validation.
#3a) Load the dataset and use all numerical features (area, bedrooms, bathrooms, stories,
parking) to predict price.
x = df[features].values
y = df["price"].values.reshape(-1, 1)
alpha = 10
I = np.eye(X_b.shape[1])
m, n = X.shape
theta = np.random.randn(n, 1)
for i in range(iterations):
theta -= lr * gradients
return theta
theta_gd = ridge_gradient_descent(X_b, y)
y_pred_gd = X_b.dot(theta_gd)
a) Use GridSearchCV to tune alpha for Ridge over the range {0.01, 0.1, 1, 10, 100}.
# GridSearchCV
ridge = Ridge()
grid.fit(x_train, y_train)
y_pred = best_ridge.predict(x_test)
b) Try at least two or more different values of α in your implementation. Evaluate the models
using Mean Squared Error (MSE) and R² score, and record how the choice of α affects the
results.
y_pred = lasso.predict(X_test_scaled)
print(f"\nAlpha = {alpha_val}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.4f}")
print("Coefficients:", lasso.coef_)
Of course. Here's a clear explanation of the code and a set of potential viva voce (oral
examination) questions to help you prepare.
Code Explanation
This Python script uses the scikit-learn library to perform LASSO (Least Absolute
Shrinkage and Selection Operator) Regression. The main goal is to train and evaluate two
separate LASSO models using two different, very high values for the regularization parameter,
alpha.
Step-by-Step Breakdown
1. Initialization
○ alphas = [1000, 10000]: This list holds the two values for the alpha
hyperparameter that will be tested. alpha controls the strength of the
regularization. A higher alpha value imposes a stronger penalty on the model's
coefficients.
○ results_lasso = {}: An empty dictionary is created to store the evaluation
metrics (MSE, R²) and the learned coefficients for each alpha value.
2. Iterating Through Alpha Values
○ for alpha_val in alphas:: This loop runs the entire training and
evaluation process once for each value in the alphas list (first for 1000, then for
10000).
3. Model Creation and Training
○ lasso = Lasso(alpha=alpha_val, random_state=42): An instance of
the LASSO regression model is created.
■ alpha=alpha_val: The regularization strength is set to the current
value from the loop.
■ random_state=42: This ensures that the results are reproducible. If the
algorithm had a random component, it would produce the same result
every time it's run.
○ lasso.fit(X_train_scaled, y_train): This is the training step. The
model learns the relationship between the features (X_train_scaled) and the
target variable (y_train). The term _scaled suggests the training features
have been standardized or normalized, which is crucial for LASSO to work
effectively.
4. Prediction and Evaluation
○ y_pred = lasso.predict(X_test_scaled): The trained model is used to
make predictions on new, unseen data (X_test_scaled).
○ mse = mean_squared_error(y_test, y_pred): The Mean Squared
Error is calculated. This metric measures the average of the squares of the
errors—that is, the average squared difference between the actual values
(y_test) and the predicted values (y_pred). Lower is better.
○ r2 = r2_score(y_test, y_pred): The R-squared (R²) score is calculated.
This represents the proportion of the variance in the dependent variable that is
predictable from the independent variables. It ranges from -∞ to 1. Closer to 1 is
better.
5. Storing and Displaying Results
○ results_lasso[alpha_val] = ...: The calculated MSE, R² score, and the
model's learned coefficients (lasso.coef_) are stored in the dictionary, using
the alpha value as the key.
○ print(...): The results for the current alpha are printed to the console for
immediate review. The coefficients are particularly important for LASSO, as they
show which features the model has selected or discarded.
💡 Key Takeaway
The purpose of this code is to demonstrate the effect of a very strong regularization penalty
in LASSO regression. By using large alpha values like 1000 and 10000, the experiment is set
up to force many of the model's coefficients (lasso.coef_) to become exactly zero. This
process is known as feature selection.
Fundamental Concepts
● Answer: LASSO is a linear regression model that includes a regularization term. Its
main purpose is to prevent overfitting and perform automatic feature selection. It does
this by adding a penalty proportional to the absolute value of the magnitude of the
coefficients. This is also known as L1 Regularization.
Q3: What is the difference between LASSO (L1) and Ridge (L2) Regression?
Code-Specific Questions
Q4: In your code, you use X_train_scaled. Why is feature scaling important for
LASSO?
● Answer: LASSO's penalty is based on the size of the coefficients. If features are on
different scales (e.g., one feature from 0-1 and another from 100-10,000), the feature
with the larger scale will have a smaller coefficient to begin with, and the penalty will
unfairly affect features with smaller scales. Scaling (like Standardization or
Normalization) ensures all features are on a comparable scale, so the LASSO penalty is
applied fairly to all of them.
Q5: Looking at your alpha values (1000, 10000), what do you expect to see in the
lasso.coef_ output?
● Answer: I expect to see that for alpha=1000, many of the coefficients will be zero. For
alpha=10000, which is an even stronger penalty, I expect even more (or potentially
all) coefficients to be zero. The model is being heavily "punished" for having any
non-zero coefficients.
Q6: How would you expect the R² score to change when you increase alpha from 1000 to
10000?
● Answer: I expect the R² score to decrease (get worse). Because a higher alpha forces
more coefficients to zero, the model becomes simpler. These extremely high alpha
values are likely to make the model too simple (i.e., cause underfitting), to the point
where it can't capture the underlying patterns in the data, leading to a poorer R² score.
● Answer: It means the LASSO model has determined that the corresponding feature is
not important for predicting the target variable and has effectively removed it from the
model. This is the feature selection property of LASSO.
● Answer: A negative R² score means the model is performing worse than a simple
horizontal line representing the mean of the target variable (y_test). It's a strong
indicator that the model is a very poor fit for the data, which can easily happen with an
overly aggressive alpha value that creates a model that is too simple.
Q9: How would you choose the best value for alpha in a real-world project?
● Answer: I wouldn't just test two arbitrary values. I would use a technique like
Cross-Validation (e.g., LassoCV in scikit-learn or GridSearchCV) to systematically
test a range of alpha values and find the one that provides the best performance on
unseen data, balancing model complexity and predictive power.
where:
- α controls the strength of regularization,
- λ₁ and λ₂ balance between LASSO and Ridge penalties.
b) Experiment with different values of α and l1_ratio. Compare the model’s performance using
MSE and R² score, and report how the parameter choices affect the results compared to
LASSO.
y_pred = elastic_net.predict(X_test_scaled)
print(f"\n{key}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.4f}")
print("Coefficients:", elastic_net.coef_)
This Python script trains and evaluates an ElasticNet regression model using two different
sets of hyperparameters. It then stores and prints the performance metrics (MSE and R²) and
the model's coefficients for each hyperparameter combination.
Code Explanation
1. Initialization:
○ params = [...]: This is a list of dictionaries. Each dictionary specifies a
combination of hyperparameters to be tested. In this case, you're testing two
sets: (alpha=1000, l1_ratio=0.5) and (alpha=1000, l1_ratio=0.9).
○ results_elastic = {}: An empty dictionary is created to store the
evaluation results for each model trained.
2. Iteration and Training Loop:
○ for p in params:: The code iterates through each dictionary (p) in the
params list.
○ elastic_net = ElasticNet(...): Inside the loop, a new ElasticNet
model is created for each set of parameters.
■ alpha=p['alpha']: Sets the overall regularization strength. A higher
alpha value imposes a stronger penalty on the model's coefficients.
■ l1_ratio=p['l1_ratio']: Sets the mix between L1 (Lasso) and L2
(Ridge) penalties. An l1_ratio of 0.5 means the penalty is 50% L1
and 50% L2.
■ random_state=42: Ensures that the results are reproducible every time
the code is run.
○ elastic_net.fit(X_train_scaled, y_train): The model is trained
using the scaled training features (X_train_scaled) and the training target
values (y_train).
3. Prediction and Evaluation:
○ y_pred = elastic_net.predict(X_test_scaled): The trained model is
used to make predictions on the unseen (and scaled) test data.
○ mse = mean_squared_error(...) and r2 = r2_score(...): The code
calculates two key performance metrics: Mean Squared Error (MSE) and the
R-squared (R²) score by comparing the actual test values (y_test) with the
model's predictions (y_pred).
4. Storing and Displaying Results:
○ key = f"...": A descriptive string key (e.g., "alpha=1000, l1_ratio=0.5") is
created for the current hyperparameter set.
○ results_elastic[key] = {...}: The calculated MSE, R2 score, and the
learned model coefficients (elastic_net.coef_) are stored in the
results_elastic dictionary under this key.
○ print(...): The results for the current iteration are printed to the console for
immediate review.
Q2: What is ElasticNet Regression and why would you use it?
A: ElasticNet is a regularized linear regression model that linearly combines the L1 (Lasso)
and L2 (Ridge) penalties.
A:
● alpha (α): This is the regularization strength parameter. It's a constant that multiplies
the L1 and L2 penalty terms.
○ A high alpha (like 1000 in your code) imposes a strong penalty, which will
shrink the coefficients more aggressively towards zero.
○ An alpha of 0 removes regularization entirely, making it equivalent to a standard
Linear Regression.
● l1_ratio (ρ): This parameter controls the mix between L1 and L2 regularization. Its
value is between 0 and 1.
○ l1_ratio = 1: The model is a Lasso Regression.
○ l1_ratio = 0: The model is a Ridge Regression.
○ 0 < l1_ratio < 1: The model is a combination of both. For example,
l1_ratio=0.5 means the penalty is an equal mix of L1 and L2 norms.
Q5: How do you interpret the MSE and R² scores that are printed?
A:
● Mean Squared Error (MSE): This is the average of the squared differences between the
predicted and actual values. A lower MSE is better, with 0 being a perfect score. Its
units are the square of the target variable's units, which can make it hard to interpret
directly.
● R² Score (Coefficient of Determination): This indicates the proportion of the variance
in the target variable that is explained by the model.
○ An R² of 1 means the model perfectly predicts the data.
○ An R² of 0 means the model performs no better than simply predicting the mean
of the target variable.
○ A negative R² means the model is worse than the baseline mean model.
○ A higher R² is better.
If a coefficient is zero, it means the model has determined that this feature is not useful for
making predictions. The L1 penalty component of ElasticNet is responsible for this, effectively
performing automatic feature selection.
X_poly_train = X_train[['area']]
X_poly_test = X_test[['area']]
degrees = [2, 3, 4]
results_poly = {}
for d in degrees:
poly_reg = Pipeline([
('lin_reg', LinearRegression())
])
poly_reg.fit(X_poly_train, y_train)
y_pred = poly_reg.predict(X_poly_test)
r2 = r2_score(y_test, y_pred)
print(f"\nDegree = {d}")
Code Explanation
1. Feature Selection:
○ X_poly_train = X_train[['area']]: Instead of using the entire dataset, this line
creates a new DataFrame X_poly_train that contains only the 'area' column
from the original training set. The same is done for the test set. This is done to
build a model based solely on this single feature.
2. Setup for Iteration:
○ degrees = [2, 3, 4]: A list containing the different polynomial degrees to be tested.
The loop will run three times, creating a quadratic (degree 2), cubic (degree 3),
and quartic (degree 4) model.
○ results_poly = {}: An empty dictionary is initialized to store the performance
metrics (MSE and R²) for each degree tested.
3. Model Building and Training Loop:
○ for d in degrees:: The code iterates through each degree in the degrees list.
○ poly_reg = Pipeline([...]): This creates a scikit-learn Pipeline. A pipeline is an
excellent tool that chains multiple data processing steps together. Here, it
combines two steps:
1. ('poly', PolynomialFeatures(...)): The first step generates new polynomial
features. For an input feature x (area), PolynomialFeatures(degree=d) will
create new features x2,x3,...,xd.
■ include_bias=False: This is set to False because the subsequent
LinearRegression step automatically handles the intercept (bias)
term.
2. ('lin_reg', LinearRegression()): The second step is a standard
LinearRegression model. This model takes the newly created polynomial
features (e.g., area and area2) as its input and finds the best linear fit for
them.
○ poly_reg.fit(X_poly_train, y_train): The entire pipeline is trained. The 'area' data is
first transformed into polynomial features, which are then used to train the linear
regression model.
4. Prediction and Evaluation:
○ y_pred = poly_reg.predict(X_poly_test): The trained pipeline is used to make
predictions. The test data (X_poly_test) automatically undergoes the same
PolynomialFeatures transformation before the prediction is made.
○ mse = mean_squared_error(...) and r2 = r2_score(...): The Mean Squared Error
(MSE) and R-squared (R²) score are calculated to evaluate how well the model
performed on the unseen test data.
5. Storing and Displaying Results:
○ results_poly[d] = {...}: The calculated metrics are stored in the results_poly
dictionary, using the degree d as the key.
○ print(...): The degree and its corresponding performance metrics are printed to
the console.
It's used when a simple straight line (standard linear regression) is not sufficient to capture the
underlying trend in the data. It allows for a more flexible, curved fit.
y=β1(area)+β2(area2)+β3(area3)+β0
Q4: What is the risk of choosing a very high degree (e.g., 20) for the
polynomial?
A: The biggest risk is overfitting. A high-degree polynomial is extremely flexible and can wiggle
its way through every single data point in the training set, resulting in a very high R² score on
that data. However, it will likely fail to generalize to new, unseen data because it has learned the
noise in the training set, not the underlying pattern. This will result in poor performance (high
MSE, low R²) on the test set.
Q5: Looking at the printed results, how would you choose the "best"
degree?
A: You would typically look for the degree that gives the best performance on the test set,
which usually means the highest R² score and the lowest MSE.
However, if two degrees (e.g., 3 and 4) give very similar performance on the test set, it's often
better to choose the simpler model (degree 3) based on the principle of Occam's Razor. A
simpler model is less likely to be overfitted and is more generalizable.
Q6: Could you use Polynomial Regression with more than one feature, for
example, 'area' and 'bedrooms'? What would PolynomialFeatures do then?
A: Yes, absolutely. If you provided two features, 'area' (x1) and 'bedrooms' (x2), to
PolynomialFeatures(degree=2), it would generate not only the squared terms but also the
interaction terms. The output features would be: [x1,x2,x12,x1x2,x22]. This allows the model
to capture how the features interact with each other.
b) Use GridSearchCV to find the best value of K by performing cross-validation. Report the best
polynomial degree and the corresponding MSE and R² score.
X_poly = df[['area']]
y_poly = df['price']
pipeline = Pipeline([
('poly', PolynomialFeatures(include_bias=False)),
('regressor', LinearRegression())
])
grid_search.fit(X_poly, y_poly)
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_poly_test)
random_search.fit(X_poly, y_poly)
best_model_random = random_search.best_estimator_
y_pred_best_random = best_model_random.predict(X_poly_test)
This code automates the process of finding the best hyperparameter (in this case, the
polynomial degree) for a regression model using two common techniques: Grid Search and
Randomized Search. It then evaluates the best model found by each technique on a separate
test set.
Code Explanation
1. Common Setup
● Data Selection: The code selects the single feature area as the input (X_poly) and price
as the target (y_poly) from the main dataframe df.
● Pipeline Definition: A Pipeline is created to first generate polynomial features from the
'area' column and then feed them into a LinearRegression model. This pipeline is the
base model that will be tuned.
2. GridSearchCV
● Goal: To exhaustively search through a manually specified list of hyperparameter values
to find the best one.
● param_grid: This dictionary defines the search space. 'poly__degree': [1, 2, 3, 4, 5, 6]
tells Grid Search to test every single integer degree from 1 to 6 for the poly step in the
pipeline. The double underscore __ is the syntax used to access parameters of a step
within a pipeline.
● GridSearchCV(...): This object orchestrates the search.
○ cv=5: It uses 5-fold cross-validation. The data is split into 5 parts. For each
degree, the model is trained 5 times, each time using 4 parts for training and 1
part for validation. This prevents overfitting to a single train-test split and gives a
more robust performance estimate.
○ scoring='neg_mean_squared_error': This is the metric used to judge which
degree is best. Scikit-learn's convention is that higher scores are better. Since we
want to minimize Mean Squared Error (MSE), we use its negative
(neg_mean_squared_error), which we want to maximize.
○ n_jobs=-1: Uses all available CPU cores to speed up the computation.
● .fit(X_poly, y_poly): This command starts the search. It will train a total of 6 (degrees) ×
5 (folds) = 30 models.
● Evaluation: After finding the best degree, the code retrieves the best_estimator_ (the
pipeline retrained on all data with the best degree) and evaluates its final performance
on the unseen test set (X_poly_test).
3. RandomizedSearchCV
● Goal: To efficiently search for the best hyperparameters by randomly sampling from a
given distribution of values.
● param_dist: Defines the distribution to sample from. Here, range(1, 11) means degrees
can be any integer from 1 to 10.
● RandomizedSearchCV(...): This object runs the randomized search.
○ n_iter=5: This is the key parameter. Instead of trying all 10 possible degrees, it
will randomly select and test only 5 of them.
● .fit(X_poly, y_poly): This starts the search. It will train a total of 5 (iterations) × 5 (folds)
= 25 models.
● Evaluation: The process of retrieving the best model and evaluating it on the test set is
identical to Grid Search.
Q2: In this code, why is cv=5 used? What problem does cross-validation
solve?
A: cv=5 specifies 5-fold cross-validation. It's a technique to get a reliable estimate of a
model's performance. It solves the problem of evaluation variance. If you use a single
train-validation split, your performance metric might be high or low simply due to the luck of how
the data was split. By training and validating the model 5 times on 5 different subsets of the data
and averaging the results, cross-validation provides a much more stable and trustworthy
measure of the model's true performance.
The final evaluation on a completely held-out test set (data the model has never seen during
training or tuning) provides an unbiased estimate of how the model will perform in the real
world on new data. This is the true measure of the model's generalization ability.