Loss Functions
Neural Network uses optimizing strategies like stochastic gradient descent to
minimize the error in the algorithm. The way we actually compute this error is
by using a Loss Function. It is used to quantify how good or bad the model is
performing.
Loss functions can be classified into two major categories depending upon
the type of learning task we are dealing with — Regression
losses and Classification losses.
In classification, we are trying to predict output from set of finite categorical
values i.e Given large data set of images of hand written digits, categorizing
them into one of 0–9 digits.
Regression, on the other hand, deals with predicting a continuous value for
example given floor area, number of rooms, size of rooms, predict the price
of room.
NOTE
n - Number of training examples.
i - ith training example in a data set.
y(i) - Ground truth label for ith training example.
y_hat(i) - Prediction for ith training example.
Regression Losses
1. Mean Square Error/Quadratic Loss/L2 Loss
Mathematical formulation :-
As the name suggests, Mean square error is measured as the average of squared
difference between predictions and actual observations.
It’s only concerned with the average magnitude of error irrespective of their
direction. However, due to squaring, predictions which are far away from actual
values are penalized heavily in comparison to less deviated predictions.
# calculate mean squared error
def mean_squared_error(actual, predicted):
sum_square_error = 0.0
for i in range(len(actual)):
sum_square_error += (actual[i] - predicted[i])**2.0
mean_square_error = 1.0 / len(actual) * sum_square_error
return mean_square_error
from sklearn.metrics import mean_squared_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_squared_error(y_true, y_pred)
0.375
2. Mean Absolute Error/L1 Loss
Mean absolute error, on the other hand, is measured as the average of sum of
absolute differences between predictions and actual observations. Like MSE, this as
well measures the magnitude of error without considering their direction.
MAE is more robust to outliers since it does not make use of square.
Mathematical formulation :-
.
from sklearn.metrics import mean_absolute_error
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> mean_absolute_error(y_true, y_pred)
0.5
MAE loss is useful if the training data is corrupted with outliers (i.e. we
erroneously receive unrealistically huge negative/positive values in our
training environment, but not our testing environment).
Deciding which loss function to use
If the outliers represent anomalies that are important for business and should
be detected, then we should use MSE. On the other hand, if we believe that
the outliers just represent corrupted data, then we should choose MAE as
loss.
L1 loss is more robust to outliers, but its derivatives are not continuous,
making it inefficient to find the solution. L2 loss is sensitive to outliers, but
gives a more stable and closed form solution
3. Huber Loss:
Mean Square Error (MSE) is greater for learning the outliers in the
dataset, on the other hand, Mean Absolute Error(MAE) is good to ignore
the outliers.
But in some cases, the data which looks like outliers should not be
ignored and also those points should not get high priority. Here where
Huber Loss comes in.
Huber Loss = Combination of both MSE and MAE
Huber loss is both MSE and MAE means it is quadratic(MSE) when the error is
small else MAE. Here delta is the hyperparameter to define the range for MAE
and MSE which can be iterative to make sure the correct delta value.
Classification Losses
1. Cross Entropy Loss
Cross-entropy loss is often simply referred to as “cross-entropy,”
“logarithmic loss,” “logistic loss,” or “log loss” for short.
It gives the probability value between 0 and 1 for a classification task. Cross-
Entropy calculates the average difference between the predicted and actual
probabilities.
Each predicted probability is compared to the actual class output value (0 or
1) and a score is calculated that penalizes the probability based on the
distance from the expected value. The penalty is logarithmic, offering a small
score for small differences (0.1 or 0.2) and enormous score for a large
difference (0.9 or 1.0).
This is the most common setting for classification problems. Cross-entropy
loss increases as the predicted probability diverges from the actual label.
Consider a 4-class classification task where an image is classified as either a
dog, cat, horse or cheetah.
Let us calculate the probability generated by the first logit after Softmax is
applied
E= 2.73
In the above Figure, Softmax converts logits into probabilities. The purpose of the
Cross-Entropy is to take the output probabilities (P) and measure the distance from
the truth values (as shown in Figure below).
Cross-entropy is defined as
Binary Cross-Entropy Loss
For binary classification, we have binary cross-entropy defined as
Or it can be written as follows
Multi-class cross-entropy / categorical cross-entropy
We use multi-class cross-entropy for multi-class classification problems. Let’s
say we need to create a model that predicts the type/class of fruit. We have
three types of fruit (oranges, apples, lemons) in different containers.
from sklearn.metrics import log_loss
>>> log_loss(["spam", "ham", "ham", "spam"],
... [[.1, .9], [.9, .1], [.8, .2], [.35, .65]])
0.21616...
from math import log
# calculate binary cross entropy
def binary_cross_entropy(actual, predicted):
sum_score = 0.0
for i in range(len(actual)):
sum_score += actual[i] * log(1e-15 + predicted[i])
mean_sum_score = 1.0 / len(actual) * sum_score
return mean_sum_score