0% found this document useful (0 votes)
8 views6 pages

Classification

The document explains the concept of classification in machine learning, contrasting it with regression, and introduces the RandomForestClassifier for predicting mutually exclusive outcomes. It provides a practical example using a dataset of phone features to predict price ranges, detailing the model creation, fitting, and evaluation using accuracy and confusion matrix. Additionally, it discusses the importance of class probabilities in making business decisions based on expected monetary values.

Uploaded by

noha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

Classification

The document explains the concept of classification in machine learning, contrasting it with regression, and introduces the RandomForestClassifier for predicting mutually exclusive outcomes. It provides a practical example using a dataset of phone features to predict price ranges, detailing the model creation, fitting, and evaluation using accuracy and confusion matrix. Additionally, it discusses the importance of class probabilities in making business decisions based on expected monetary values.

Uploaded by

noha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ClassifiCation

So far you've predicted numeric targets. This type of modeling is


called regression, hence the "Regressor" part of RandomForestRegressor.
Another common problem you'll see is making a choice between mutually
exclusive outcomes. For example, spam detection is predicting whether an email is
"spam" or "not spam" based on the email's content. This type of modeling is
called classification.
There are two types of classification: binary (choosing between two classes) and
multiclass (choosing between more than two classes). In general there are different
approaches to the two types of classification, but most multiclass models will also
work for binary problems.
It's straightforward to build classification models using what you already know
about scikit-learn. Instead of RandomForestRegressor, you will
use RandomForestClassifier.
As an example of classification with RandomForestClassifier, I'll use a dataset of
phone features to predict a phone's price range. The targets in the data have values:
• 0 (low cost)
• 1 (medium cost)
• 2 (high cost)
• 3 (very high cost)
The features are things like
• battery_power: Total energy a battery can store in one time measured in
mAh
• blue: Has bluetooth or not
• clock_speed: speed at which microprocessor executes instructions
• dual_sim: Has dual sim support or not
• fc: Front Camera mega pixels
• four_g: Has 4G or not
• ....
Here is a quick overview of the data

import pandas as pd
from sklearn.model_selection import train_test_split
from [Link] import RandomForestClassifier
import [Link] as metrics

data = pd.read_csv('../input/mobile-price-classification/[Link]')
[Link]()

In[2]: [Link]
Out[2]:
Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
'touch_screen', 'wifi', 'price_range'],
dtype='object')
We create our feature and targets the same as before using train_test_split. This
part looks like what you've already seen.

In[3]:
# Set variables for the targets and features
y = data['price_range']
X = [Link]('price_range', axis=1)

# Split the data into training and validation sets


train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=7)

Creating and fitting the model is similar to what you've done before, except you'll
use RandomForestClassifier instead of RandomForestRegressor.

In[4]:
# Create the classifier and fit it to our training data
model = RandomForestClassifier(random_state=7, n_estimators=100)
[Link](train_X, train_y)

Out[4]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=7, verbose=0,
warm_start=False)
The simplest metric for classification models is the accuracy, the fraction
predictions that are correct. Scikit-learn provides metrics.accuracy_score to
calculate this.

In[5]:
# Predict classes given the validation features
pred_y = [Link](val_X)

# Calculate the accuracy as our performance metric


accuracy = metrics.accuracy_score(val_y, pred_y)
print("Accuracy: ", accuracy)
Accuracy: 0.864

Confusion Matrix:
Our model did pretty well, correctly predicting around 86% of the price ranges in
the validation data. It's often useful to look at where the model is failing with
a confusion matrix which shows us how our model classified the inputs.

In[6]:
# Calculate the confusion matrix itself
confusion = metrics.confusion_matrix(val_y, pred_y)
print(f"Confusion matrix:\n{confusion}")
# Normalizing by the true label counts to get rates
print(f"\nNormalized confusion matrix:")
for row in confusion:
print(row / [Link]())

It's a little easier to understand as a nice little figure like so:


The rows of the confusion matrix are the true class and the columns are the
predicted class. The diagonal tells us how many of each class the model predicted
correctly. The off-diagonals show where the model is making wrong predictions,
where it is "confused." For example, looking at the first column and second row,
we classified four phones that were actually low cost as medium cost. We see for
classes 0 and 3, the low cost and highest cost phones, our model works really well,
above 90% accurate. However, our model is weaker for medium and high cost
phones. Note that incorrect predictions are only between adjacent classes. The
model doesn't confuse low cost and very high cost phones.

Class probabilities
Classification models actually calculate a probability distribution over the classes.
Using : [Link] simply returns the class with the highest probability. This
might not be ideal based on how the decision affects your metrics or downstream
measures. To get the probabilities themselves, use the .predict_proba method.
In[7]:
probs = model.predict_proba(val_X)
print(probs)

This shows the probability the model assigns to each class. Often in business
problems, decisions you make lead to different monetary returns. The expected
return for a decision based on your classifier is the probability times the monetary
return of that decision.
Consider probabilities [0.05 0.17 0.42 0.36]. Assume the third option would result
in $100 of profit while the fourth option would return \$150 in profit. Then the
expected monetary values are 0.42∗$100=$42 and 0.36∗$150=$54. Even though
the third option has the highest probability, on average it would be better from a
business perspective to choose the fourth option.

You might also like