Activation Functions in
Neural Networks
Recall Linearly/Non-Linearly Separable Data
y= mx+c
The 2 classes can be The 2 classes can be separated
separated by a straight line by a curve or a more complex
function that a straight line
Why do we need activations?
• Our real world data in non-linear, and cannot be separated by a
straight line. We wish to learn much more complex functions to be
able to predict/classify the data we are working with.
Therefore, we need an Activation function f(x) to make our neural network more
powerful and enable it to learn complex and complicated data and represent non-
linear complex arbitrary functional mappings between inputs and outputs, in
addition to multiple MLPs. An Activation function is a non-linear function which takes
a linear scalar 𝑧1 as its input and maps it to another numerical value 𝑦1
Activation Function
Linear Non-Linear
x1 𝑧1 𝑦1
(𝑠𝑢𝑚) f(x)
x2
𝑧2 𝑦2
(𝑠𝑢𝑚)
f(x)
x3
Sigmoid/Logistic Activation
Any input will be scaled to a value
between 0-1
Ex:
x = 2 f(x) = 1 / (1 + 𝑒 −2 ) = 0.88080
x = -1 f(x) = 1 / (1 + 𝑒 1 ) = 0.26894
Sigmoid Function Derivative
The problem in Sigmoid
The blue is Sigmoid. The orange graph is the derivative of Sigmoid. This is the cause of vanishing gradients in feedforward
networks. The f’ terms are all outputting values << 1. When we multiply lots of numbers << 1 together, we end up with out
gradient being killed.
Let’s even take the best case. Realize that the maximum of the sigmoid derivative is around 0.2. With many layers there will still
be a vanishing gradient problem. With only 4 layers of 0.2 valued derivatives we have a product of 0.24 . BUT, Practically we
have very deep architectures! Most of the layers would die!
0.23 * 0.23 * 0.23 * 0.23 = 0.00279
0.1 * 0.2 * 0.15 * 0.18 = 0.00054
Gradients Vanish Easily
1 − 𝑒 −2𝑥
𝑓 𝑥 =
1 + 𝑒 −2𝑥
Examples:
f(0) = 0
f(1) = 0.761
f(-0.5) = -0.462
f(1.2) = 0.833
Derivative of Tanh
𝑑
tanh 𝑥 = 1 − tanh(𝑥)2
𝑑𝑥
If the input is negative Output is Zero
If the input is positive Stays Positive
Ex: 66
00
-3 0
22
Derivative of ReLU
One small problem:
When the input is negative, the output is 0 and the gradient
will die.
Example: When initializing from the normal distribution
N(0,1), half of the values are negative. Activating with ReLU
means setting half of the values to 0.
If α = 0.01 Leaky ReLU
y
ReLU PReLU x
α is a learnable parameter
PReLU solves the dying ReLU problem (ReLU suffers from dead neurons of output 0),
where gradients become zero.
PReLU derivative:
Exponentially Linear Units (ELUs)
For negative values, we
have a exponential curve
rather than a flat line
ELU derivative:
Comparison
GLU (Gated Linear Units)
Paper: Language Modeling with Gated Convolutional Networks
To activate a layer of dimension d, we make it output double its dimension (2xd). Then we split it into 2 halves. The first half acts
as the original layer output, and the second half acts as a gating layer. The gating layer is followed by a sigmoid activation
function which squashes each value to a range of 0-1. The gate values are then element-wise multiplied with the first half.
(𝑋𝑊1 + 𝑏1 )⊗𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑋𝑊2 + 𝑏2 )
Example: d=2
dx2
d d
⊗
to activate
this layer : Activated
.
.
. .
.
.
.
.
.
𝜎 .
.
Sigmoid .
Swish
Multiplication of the input x with the sigmoid function for x
Swish is a smooth function. That means that it does not abruptly change direction like ReLU does near x = 0.
Rather, it smoothly bends from 0 towards values < 0 and then upwards again.
This observation means that it’s also non-monotonic. It thus does not remain stable or move in one direction,
such as ReLU and the other two activation functions.
Why is it better than ReLU?
• Sparsity: Very negative weights are simply zeroed out.
• For very large values, the outputs do not saturate to the maximum value
• Small negative values are zeroed out in ReLU. However, those negative
values may still be relevant for capturing patterns underlying the data,
which are still preserved in Swish.
Softplus
1
𝑓 𝑥 = log(1 + 𝑒 𝛽𝑥 )
𝛽 SoftPlus compared to ReLU:
If 𝛽 = 1:
𝑓 𝑥 = log(1 + 𝑒 𝑥 )
Softplus Derivative:
The sigmoid function!
1
𝑓′ 𝑥 =
1 + 𝑒 −𝑥
Mish Activation
a modified Gated form of Softplus Activation Function
x*([Link]([Link](x)))
• Unbounded above no saturation
• Small negative values are not zero, which allows better
gradient flow vs a hard zero bound as in ReLU.
• Continuous, unlike ReLU which is which is
discontinuous at zero, which helps in effective
optimization and generalization
Mish Compared to other activation functions
Results of Mish
Mish activation function:
2.15% increase from ReLU!