0% found this document useful (0 votes)
80 views23 pages

Activation Functions

The document discusses the importance of activation functions in neural networks, emphasizing their role in enabling the learning of complex, non-linear data. It covers various activation functions such as Sigmoid, ReLU, and Mish, detailing their characteristics, advantages, and drawbacks, particularly in relation to issues like vanishing gradients. The document also compares these functions and highlights the benefits of newer functions like Mish over traditional ones like ReLU.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views23 pages

Activation Functions

The document discusses the importance of activation functions in neural networks, emphasizing their role in enabling the learning of complex, non-linear data. It covers various activation functions such as Sigmoid, ReLU, and Mish, detailing their characteristics, advantages, and drawbacks, particularly in relation to issues like vanishing gradients. The document also compares these functions and highlights the benefits of newer functions like Mish over traditional ones like ReLU.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Activation Functions in

Neural Networks
Recall Linearly/Non-Linearly Separable Data

y= mx+c

The 2 classes can be The 2 classes can be separated


separated by a straight line by a curve or a more complex
function that a straight line
Why do we need activations?
• Our real world data in non-linear, and cannot be separated by a
straight line. We wish to learn much more complex functions to be
able to predict/classify the data we are working with.
Therefore, we need an Activation function f(x) to make our neural network more
powerful and enable it to learn complex and complicated data and represent non-
linear complex arbitrary functional mappings between inputs and outputs, in
addition to multiple MLPs. An Activation function is a non-linear function which takes
a linear scalar 𝑧1 as its input and maps it to another numerical value 𝑦1

Activation Function

Linear Non-Linear
x1 𝑧1 𝑦1
෍(𝑠𝑢𝑚) f(x)

x2
𝑧2 𝑦2
෍(𝑠𝑢𝑚)
f(x)

x3
Sigmoid/Logistic Activation

Any input will be scaled to a value


between 0-1
Ex:
x = 2  f(x) = 1 / (1 + 𝑒 −2 ) = 0.88080
x = -1  f(x) = 1 / (1 + 𝑒 1 ) = 0.26894
Sigmoid Function Derivative
The problem in Sigmoid
The blue is Sigmoid. The orange graph is the derivative of Sigmoid. This is the cause of vanishing gradients in feedforward
networks. The f’ terms are all outputting values << 1. When we multiply lots of numbers << 1 together, we end up with out
gradient being killed.

Let’s even take the best case. Realize that the maximum of the sigmoid derivative is around 0.2. With many layers there will still
be a vanishing gradient problem. With only 4 layers of 0.2 valued derivatives we have a product of 0.24 . BUT, Practically we
have very deep architectures! Most of the layers would die!

0.23 * 0.23 * 0.23 * 0.23 = 0.00279


0.1 * 0.2 * 0.15 * 0.18 = 0.00054

Gradients Vanish Easily


1 − 𝑒 −2𝑥
𝑓 𝑥 =
1 + 𝑒 −2𝑥
Examples:
f(0) = 0
f(1) = 0.761
f(-0.5) = -0.462
f(1.2) = 0.833
Derivative of Tanh

𝑑
tanh 𝑥 = 1 − tanh(𝑥)2
𝑑𝑥
If the input is negative  Output is Zero
If the input is positive  Stays Positive

Ex: 66
00
-3  0
22
Derivative of ReLU

One small problem:


When the input is negative, the output is 0 and the gradient
will die.

Example: When initializing from the normal distribution


N(0,1), half of the values are negative. Activating with ReLU
means setting half of the values to 0.
If α = 0.01  Leaky ReLU
y

ReLU PReLU x
α is a learnable parameter

PReLU solves the dying ReLU problem (ReLU suffers from dead neurons of output 0),
where gradients become zero.

PReLU derivative:
Exponentially Linear Units (ELUs)

For negative values, we


have a exponential curve
rather than a flat line

ELU derivative:
Comparison
GLU (Gated Linear Units)
Paper: Language Modeling with Gated Convolutional Networks

To activate a layer of dimension d, we make it output double its dimension (2xd). Then we split it into 2 halves. The first half acts
as the original layer output, and the second half acts as a gating layer. The gating layer is followed by a sigmoid activation
function which squashes each value to a range of 0-1. The gate values are then element-wise multiplied with the first half.

(𝑋𝑊1 + 𝑏1 )⊗𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑋𝑊2 + 𝑏2 )
Example: d=2
dx2
d d

to activate
this layer : Activated
.
.
. .
.
.
.
.
.
𝜎 .
.
Sigmoid .
Swish
Multiplication of the input x with the sigmoid function for x
Swish is a smooth function. That means that it does not abruptly change direction like ReLU does near x = 0.
Rather, it smoothly bends from 0 towards values < 0 and then upwards again.
This observation means that it’s also non-monotonic. It thus does not remain stable or move in one direction,
such as ReLU and the other two activation functions.
Why is it better than ReLU?
• Sparsity: Very negative weights are simply zeroed out.
• For very large values, the outputs do not saturate to the maximum value
• Small negative values are zeroed out in ReLU. However, those negative
values may still be relevant for capturing patterns underlying the data,
which are still preserved in Swish.
Softplus

1
𝑓 𝑥 = log(1 + 𝑒 𝛽𝑥 )
𝛽 SoftPlus compared to ReLU:

If 𝛽 = 1:

𝑓 𝑥 = log(1 + 𝑒 𝑥 )

Softplus Derivative:

The sigmoid function!

1
𝑓′ 𝑥 =
1 + 𝑒 −𝑥
Mish Activation
a modified Gated form of Softplus Activation Function

x*([Link]([Link](x)))

• Unbounded above  no saturation


• Small negative values are not zero, which allows better
gradient flow vs a hard zero bound as in ReLU.
• Continuous, unlike ReLU which is which is
discontinuous at zero, which helps in effective
optimization and generalization
Mish Compared to other activation functions
Results of Mish
Mish activation function:

2.15% increase from ReLU!

You might also like