Chapter 4
Model Selection and Training
Machine learning is all about creating models that can learn patterns from data and make
predictions. Model selection refers to choosing which type of machine learning model
(algorithm) to use for a given problem, while training refers to teaching that model using data.
We’ll cover:
1. What types of models exist.
2. How to choose a model.
3. How to train a model.
4. How to evaluate the model's performance.
4.1. Types of Machine Learning Models
1. Classification Models (For Categorical Data)
Goal: Predict a category or class.
Example: Predicting whether an email is "spam" or "not spam."
Target variable: Categorical (e.g., yes/no, cat/dog, etc.).
2. Regression Models (For Continuous Data)
Goal: Predict a continuous value.
Example: Predicting the price of a house based on features like area, number of rooms,
etc.
Target variable: Continuous (e.g., 1000, 2000, 2500, etc.).
4.2. Understanding a Simple Classification Example
Let’s start with a classification problem. We’ll use the Iris dataset, which is a classic dataset in
machine learning. It contains 150 data points, each describing an iris flower with 4 features
(measurements of the flowers). The goal is to predict the species of the flower based on those 4
features.
What’s in the Iris Dataset?
Features (X): Sepal length, Sepal width, Petal length, Petal width.
Target (y): The species of the iris flower, which can be either Setosa, Versicolor, or
Virginica.
Why Use Classification?
Since we are trying to predict a category (the species of the iris), this is a classification problem.
Steps to Solve the Classification Problem:
1. Load and Prepare the Data: We load the dataset and separate the data into features (X)
and target (y).
2. Split the Data: We divide the data into a training set and a testing set. The training set
is used to train the model, while the testing set is used to evaluate its performance.
3. Choose a Model: For simplicity, we will use Logistic Regression (a basic but effective
classification algorithm).
4. Train the Model: We train the model on the training data.
5. Evaluate the Model: After the model is trained, we test it on the testing data to check
how well it performs.
Code Example for Classification (Iris Dataset):
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data # Features: Sepal length, Sepal width, Petal length, Petal
width
y = iris.target # Target: Species (Setosa, Versicolor, Virginica)
# Step 2: Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Step 3: Initialize and train the Logistic Regression model
model = LogisticRegression(max_iter=200) # Set max_iter to 200 for better
convergence
model.fit(X_train, y_train) # Train the model on the training data
# Step 4: Make predictions on the testing set
y_pred = model.predict(X_test)
# Step 5: Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Logistic Regression model: {accuracy * 100:.2f}%")
Explanation:
1. load_iris() loads the Iris dataset.
2. train_test_split() splits the data into training and testing sets.
3. LogisticRegression() is the model we’re using for classification. We train it with the
fit() method.
4. predict() is used to make predictions on the test data.
5. accuracy_score() calculates the accuracy of our model, which tells us how often the
model's predictions match the actual values.
4.3. Regression Example: Predicting House Prices
Now let’s talk about regression, where the target variable is continuous (e.g., predicting the
price of a house).
What’s in the Boston Housing Dataset?
The Boston housing dataset contains information about the housing prices in Boston. The goal
is to predict the price of a house based on its features (like the number of rooms, location, etc.).
Steps to Solve the Regression Problem:
1. Load and Prepare the Data: We load the dataset and separate the data into features (X)
and target (y) (house prices).
2. Split the Data: We divide the data into training and testing sets.
3. Choose a Model: We use Linear Regression, which tries to find a line that best fits the
data.
4. Train the Model: We train the model on the training data.
5. Evaluate the Model: After training, we evaluate the model’s performance using Mean
Squared Error (MSE).
Code Example for Regression (House Price Prediction):
# Import necessary libraries
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Step 1: Load the Boston housing dataset
boston = load_boston()
X = boston.data # Features: e.g., number of rooms, crime rate, etc.
y = boston.target # Target: House prices
# Step 2: Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Step 3: Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train) # Train the model on the training data
# Step 4: Make predictions on the testing set
y_pred = model.predict(X_test)
# Step 5: Evaluate the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error of the Linear Regression model: {mse:.2f}")
Explanation:
1. load_boston() loads the Boston housing dataset.
2. train_test_split() splits the data into training and testing sets.
3. LinearRegression() is the model we’re using for regression. We train it with the fit()
method.
4. predict() is used to make predictions on the test data.
5. mean_squared_error() calculates how far off the model’s predictions are from the
actual house prices. A smaller MSE indicates a better model.
4.4. Model Evaluation
For Classification (e.g., Logistic Regression):
Accuracy: The percentage of correct predictions.
o Example: If the model correctly classifies 80 out of 100 flowers, the accuracy is
80%.
For Regression (e.g., Linear Regression):
Mean Squared Error (MSE): Measures the average squared difference between the
predicted values and the actual values.
o Example: If the model predicts house prices that are off by a few thousand dollars
on average, MSE will quantify that error.
4.5. Summary of Model Selection and Training
Here’s a summary of what we covered in this chapter:
1. Classification vs. Regression:
o Classification is for predicting categories (e.g., Iris species).
o Regression is for predicting continuous values (e.g., house prices).
2. Steps to Train a Model:
o Load the dataset and separate features from the target variable.
o Split the data into training and testing sets.
o Choose a model: Select an appropriate model (e.g., Logistic Regression for
classification, Linear Regression for regression).
o Train the model: Use the .fit() method to train the model.
o Evaluate the model: Use metrics like accuracy (for classification) or Mean
Squared Error (for regression).
Homework / Practice for Chapter 4
1. Classification: Try the Iris dataset with other classification models like K-Nearest
Neighbors or Support Vector Machines.
2. Regression: Try the California housing dataset for predicting house prices.
3. Model Evaluation: Calculate the accuracy for classification models and MSE for
regression models on your own datasets.
Now, with this more detailed explanation, I hope it’s clearer! Do you feel more comfortable with
this chapter? Would you like to go through any specific part again, or is there anything else
you’d like to explore in more detail?