0 ratings0% found this document useful (0 votes) 179 views50 pagesDL Half TechKnowledge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Recurrent Neural Networks
‘At the end of this unit, you should be able to understand and ‘comprehend the following syllabus topics:
* Recurrent Neural Networks
© Types of Recurrent Neural Networks
© Feed-Forward Neural Networks vs Recurrent Neural Networks
Long Short-Term Memory Networks (LSTM)
Encoder Decoder architectures
Recursive Neural Networks
3.1 Recurrent Neural Networks (RNNs)
“1 tecture” in Unit 1,
You have already learnt about Recurrent Neural Networks in “Neural Network (NN) architecture” in Un
fou have alr
Section 1.3 . You are good!
3.1.1 Types of Recurrent Neural Networks
more types of RNNs.
had learnt about simple RNN and deep RNN. Let's eam about some
+ Earlier you
1.One-to-OneDeep Learning (SPPU)
Recurrent
yural Networks
1. One-to-One RNN
The simplest type of RNN is One-to-One, which allows a single input and a single output. It has fixed input
and output sizes and acts as a traditional neural network. An example of One-to-One RNN application is
image classification.
The Fig. 3.1.2 illustrates One-to-One RNN.
Fig, 3.1.2
2. One-to-Many RNN
le input. It takes a fixed input size
‘One-to-Many is a type of RNN that gives multiple outputs when given a
and gives a sequence of data outputs. Its applications can be found in music generation and image
ning.
The Fig. 3.13 illustrates One-to-Many RNN.
3. Many-to-One RNN
Many-to-One RN is used when a single output is required from multiple input units or a sequence of them. It
takes a sequence of inputs to give a fixed output. Sentiment analysis is a common example of Many-to-One
RNN.
The Fig, 3.1.4 illustrates Many-to-One RN.
Input
Wmsrom a
it Ne ‘Sequer .
to-Many RNN with equal unit size is hanes Output Units is the same. A oh of ra Units. In the case
s 5 Cognit MON applicati
«The Fig. 3.5 illustrates Many-to-Many RNN with equal ion. ication of Many-
unit size.
Fig. 3.1.5
5. Many-to-Many RNN with Unequal Unit Size
«© Many-to-Many RNN is used to generate a sequence of output data from a sequence of input units Inthe case
of unequal unit size, inputs and outputs have different numbers of units. A common application of Many-to-
Many RNN with unequal unit size is machine translation.
«The Fig. 3. illustrates Many-to-Many RNN with unequal unit size
Fig. 3.1.6
Neural Networks
rks VS Recurrent
22 eedsForuent earl 2 ard neural networks and recurrent neural networks.
between feed-forw’
. quick comparison
oe
The Table 3.1.1 provides @Recurrent Neural Networks
Woe armas ————Eee“—:
Table 3.1.1
‘Comparison Attribute Feed-forward Neural Networks Recurrent Neural Networks
Signal flow direction Forward only Bidirectional
Delay introduced No Yes
Complexity Low High
Neuron independence in same | Yes No
layer
Speed High Slow
Commonly used for Pattern recognition, speech | Language translation, speech to
recognition, and _ character | text conversion, and robotic control
recognition
3.2 Long Short-Term Memory Networks (LSTM)
You previously got a brief about LSTM. Let's dive a little deeper in this section.
Fig. 321 shows a simple diagram of a LSTM cell.
Candidate
gate
(2 elements)
Output
= / |
Fig. 3.2.1
LSTM networks are the most commonly used variation of Recurrent Neural Networks (RNNs). The critical
component of the LSTM is the memory cell and the gates (including the forget gate but also the input gate). The
contents of the memory cell are modulated by the input gates and forget gates. Assuming that both of these gates
are closed, the contents of the memory cell will remain unmodified between one time-step and the next. The
{gating structure allows information to be retained across many time-steps, and consequently also allows gradients
to flow across many time-steps. This allows the LSTM model to overcome the vanishing gradient problem that
‘occurs with most Recurrent Neural Network models.
‘The LSTM uses three internal neural networks as respective gates.35
Recurrent Neural Networks
eaves)
Input.
i the state memory. The LSTM doesn’t require
repeated copies of itself, like the basic recurrent cell. So it avoids the problems of vanishing and exploding
gradients. You can place this LSTM cell on a layer and train the neural networks inside it using normal
backpropagation and optimisation,
3.2.1 LSTM Gates
‘As you understand, there are three gates used in LSTM.
1 Forget Gate
* At forget gate the input is combined with the previous output to generate a fraction between 0 and 1, that
determines how much of the previous state need to be preserved (or in other words, how much of the state
should be forgotten). This output is then multiplied with the previous state.
«Note: An activation output of 1.0 means “remember éverything” and activation output of 0.0 means “forget
everything.” From a different perspective, a better name for the forget gate might be the “remember gate"!
2. Input Gate
* Input gate operates on the same signals as the forget gate, but here the objective is to decide which new
information is going to enter the state of LSTM. The output of the input gate (again a fraction between Oand
1) is multiplied with the ‘output of tan h block that produces the new values that must be added to previous.
state, This gated vector is then added to previous state to generate current state,
3, Output Gate
‘© At output gate, the input and P
combined with the output of tan!
back into the LSTM block.
revious state are gated as before to generate another scaling fraction that is
h block that brings the current state. This output is then given out. The output
and state are fed(serv) 36 Recurrent Neural Network
3.2.2 LSTM Units (Components of LSTM)
‘The units in the layers of Recurrent Neural Networks are a variation on the classic artificial neuron.
Deep Learni
‘* Each LSTM unit has two types of connections:
1. Connections from the previous time-step (outputs of those units)
2. Connections from the previous layer
‘+The memory cell in an LSTM network is the central concept that allows the network to maintain state over time,
The main body of the LSTM unit is referred to as the LSTM block, as shown in the following Fig. 32.3.
‘Connection with time-tag
© Branching point
@ Multiplication
@ sum over alinputs
Gate activation functon
(always sigmoid)
input activation
{Uncton (usualy tanh)
Output activation
function (sual tanh)
mat Recurrent
Fig. 3.2.3
‘¢ The components in an LSTM unit are as following.
Three gates
© Input gate (input modulation gate)
© Forget gate
© Output gate
© Block input
© Memory cell
Output activation function
Peephole connections
There are three gate units, which leam to protect the linear unit from misleading signals.
1. The input gate protects the unit from irrelevant input events.
2. The forget gate helps the unit forget previous memory contents.
3, The output gate exposes the contents of the memory cell (or not) at the output of the LSTM unit,Deep
=e wa Recurrent Neural Networks
‘The output of
eee : s ae block is recurrent connected back to the block input and all of the gates for the LSTM
the (Sn tice "5et and Output gates in an LSTM unit hve sigmoid activation functions for 0, 1] retin,
Impwt and output activation function (usually stan hactation function
3.2.3 Advantages of LSTM
+ Tespecfc gated architecture of ist i designed to improve all the following shortcomings of the classical RNN.
1. AVvoi explodi . :
Void the exploding and vanishing gradients, specifically with the use of forget gate at the beginning.
2 Long term memories can be preserved ‘along with learning new trends in the data. This is achieved through
Hon of gating and maintaining state as separate signal
> Yio information on states isnt required and the model is capable of learning from default values.
4
Unlike other deep learning architectures, there are ot many hyperparameters needed to be tuned for model
optimisation,
3.3__ Encoder Decoder Architectures
‘+ Encoder-Decoder models are a family of models which learn to ‘map datapoints from an input domain to an
‘output domain via a two-stage network,
1. The encoder, represented by an encoding function z = fo), compresses the input into a latent-space
representation,
2. The decoder, y = g(z), aims to predict the output from the latent space representation. The latent
representation here essentially refers to a feature (vector) representation, which is able to capture the
Underlying semantic information of the input that is useful for predicting the output.
* These models are extremely popular in image-to-image translation problems, as well as for sequence-to-sequence
‘models in natural language processing (NLP) where you can translate say English to French. The Fig. 33.2 illustrates
the block-diagram of a simple encoder-decoder model.
Input image Latent Output map
Fig. 3.3.1
* These models are usually trained by minimising the re construction loss, Ly, y), which measures the differences
between the ground-truth output y and the subsequent reconstruction y. The output here could be an enhanced
version of the image (such as in image de-blurrin, or super-resolution), or a segmentation map. Autoencoders are
special case of encoder-decoder models in which the input and output are the same. You will learn about
autoencoders later in the book.
3.3.1 Sequence to Sequence Model (seq2seq)
+ A-sequence to sequence (seq2seq) model aims to map a fixed-length input with a fixed-length output where the
length of the input and output may differ, For example, translating "What are you doing today” from English to
Chinese has input of 5 words and output of 7 symbols (maz ttt 7) -7
Recent |
4
Deep Learning (SPU)
cle Tanto
mie] peace! gfe
seu PEEETEO GLIDER RNC w ° cvmeseionaruneo) o1eus
What are you doing today x SR INET
° em, ° t
‘+ Clearly, you cannot use a regular LSTM network to map each word from the English sentence to the Chinese
sentence. This is why the sequence to sequence model is used to address problems like this one.
* A sequence to sequence model lies behind numerous systems which you use on a daily basis. For instance,
‘seq2seq model powers applications like Google Translate, voice-enabled devices and online chatbots. Generally
speaking, seq2seq has the following applications,
1. Machine translation : Machine translation is the task of automatically converting source text in one language
to text in another language. Given @ sequence of text in a source language, there is no one single best
translation of that text to another language. This is because of the natural ambiguity and flexibility of human
language. This makes the challenge of automatic machine translation difficult, perhaps one of the most
difficult in machine learning,
2. Speech recognition : You must have used speech recognition in products such as Amazon Alexa, Apple Siri,
and Google Home. Speech recognition helps you to carry out voice-assisted tasks and the machine typically
“understands" what you need or are asking for.
3. Video captioning : You can automaticaly generate video captioning based on what is happening in the
video or what is being said in various languages. For example, the video could be in English, but you can
‘caption it in Hindi.
4. Question-Answer problems : These could be chatbot applications or voice assistants that respond to your
‘questions with probable answers. You would have experienced it in customer care section of any e-commerce
application.
3.3.2 How the Sequence to Sequence Model ‘works?
* Sequence to Sequence models use encoder-decoder architecture, For simplicity, consider the following encoder-
decoder architecture.W Deep Learning (SPPU)
Recurrent Neural Networks
The model consists of 3 parts ag following,
1. Encoder : Encoder is a ns ;
each accepts a single ee Of Several recurrent units (LSTM or GRU cells for better Performance) where
it forward, In questior cond Amoco ton wn
Each Word is represented sen P*ObleM. the input sequence is «collection ofall words from the question,
25 x where i
Blowing teria 's the order of that word. The hidden states hy are computed using the
Be = FEED he Wet x)
2. Intermediate
Rwardincs fencodery ‘Vector : This is the final hidden state produced from the encoder part of the model.
now being the formula above, This vector aims to encapsulate the information for ll input elements
bow P the decoder make accurate predictions, I acts asthe intial hidden state ofthe decoder part
3. Decoder : Like encoder, decoder is a also a
ata time step t. Each recurrent unit accep
as well as its own hidden state. In the
words from the answer, Each word is
network of several recurrent units where each predicts an output
ts a hidden state from the previous unit and produces and output
question-answering problem, the output ‘sequence is a collection of all
represented as y; where {is the order of that word. Any hidden state hy
is computed using the following formula,
he = fOWNM he)
As you can see, you are just using the previous hidden state to compute thé next one. The ‘output y, at time
step t is computed using the following formula.
Ye =. softmax (W5h),
You calculate the outputs using the hidden state at the current time step together with the respective weight
W6). Softmax is used to create a probability vector which will help you determine the final output (e.g. word
in the question-answering problem).
The power of this model lies in'the fact that it can map sequences of different lengths to each other. As you
can see, the inputs and outputs are not correlated, and their lengths can differ. This opens a whole new range
of problems which can now be solved using such architecture.
34 _ Recursive Neural Networks
Note; Do not get confused between Recursive Neural Networks and Recurrent Neural Networks (RNN). They have the
‘sme abbreviation but are different. To avoid confusion, | would use the Recursive Neural Networks in full forall
my references.
* Recursive Neural Networks, like Recurrent Neural Networks, can deal with variable length input. The primary
difference is that Recursive Neural Networks have the ability to model the hierarchical structures in the training
dataset. It is constructed in such @ way that it includes applying the same set of weights recursively over the
different tree-like structures. Recursive neural networks generalise the recurrent neural networks from a chain-like
structure to a tree-like structure.
Fe Je, images commonly have @ scene composed of many objects. Deconstructing scenes is often a
Ie fan vain of interest that is nontrivial. The recursive nature of this deconstruction challenges us to not only
lem domai
Ee but also how the objects relate to form the scene.
identify the objects in the scene, osRecurrent Neural Networks,
Learning (SPPU)
The Fig. 3.4.1 illustrates the difference between Recursive Neural Networks and Recurrent Neural Networks.
Raa
Por onhawenficiesf
that movie was cool
(b) Recursive Neural Network
Fig. 3.4.1
3.4.1 Network Architecture
+ ARRecursive Neural Network architecture is composed of a shared-weight matrix and a binary tree structure that
allows the recursive network to lear varying sequences of words or parts of an image. It is useful as a sentence
‘and scene parser. Recursive Neural Networks use a variation of backpropagation called backpropagation through
structure (BPTS). The feed-forward pass happens bottom-up, and backpropagation is top-down. Think of the
Objective as the top of the tree, whereas the inputs are the bottom.
* The Fig. 3.42 illustrates the network architecture of Recursive Neural Networks,
ys
Input: x1, x2, x3, . x5
"
he ht
Fig, 3.4.2
1g, to understand better.
© Let's take a simple architecture, such as the follo\‘Deep Learning (SPPU)
Recurrent Neural Networks,
c
Fig. 3.4.3,
+ Inthe simplest architecture, nodes are combi
network and using a non-linear activato
representation of nodes, their parent will a:
ined into parents using a weight matrix that is shared across the whole
In function such as tanh. If 4 and cz are n-dimensional vector
Iso be an n-dimensional ‘vector, calculated as following.
Pia = tanh(WICy, C2)
Where W is a learned n x 2 n weight matrix.
* This architecture, with a few improvements, has been used for successfully parsing natural scenes, syntactic parsing
of natural language sentences, and recursive autoencoding and generative modelling of 3D shape structures in the
form of cuboid abstractions.
3.4.2 Types of Recursive Neural Networks
Recursive Neural Networks have a few variations
1. Recursive Autoencoder : Recursive autoencoders lear how to reconstruct the input. In the case of Natural
Language Processing (NLP), it learns how to reconstruct contexts. A semi-supervised recursive autoencoder learns
the likelihood of certain labels in each context.
2. Recursive Neural Tensor Network : It computes a supervised objective at each node of the tree. The tensor part
of this means that it calculates the gradient a litle differently, factoring in more information at each node by taking
advantage of another dimension of information using a tensor (a matrix of thee or more dimension)
3.4.3 Applications of Recursive Neural Networks
sve and Recurrent Neural Networts share many of the Same use cases, Recurrent Neural Networks are
. - —_— on NLP because of ther tes to binary trees, contexts, and natural-language-based parsers. For
traditionally used in ‘able to break up a sentence into a binary tree, segmenting it by the linguistic
: re
example, constituency je i prof Recursive Neural Networks it a constraint that you use a pase that
properties of the sentence. seas
: ally constitu
bus the tree structure (PIB both granular structure and higher-level hierarchical structure in datasets
over ‘
* Recursive Neural Networks can SEDt ive neural networks include the folowing
ations
such as images or sentences. Application
1. Image scene decomposition
(NLP)
2. Natural Language Processing (
3,3.12 Recurrent Neural Networks
Deep Learning (SPPU)
‘Two specific network configurations you see in practice are recursive autoencoders and recursive neural tensors.
‘You use recursive autoencoders to break up sentences into segments for NLP. You use recursive neural tensors to
break up an image into its composing objects and semantically label the objects in the scene.
+ Recurrent Neural Networks tend to be faster to train, thus you typically use them in more temporal applications,
but they have been shown to work well in NLP-based domains such as sentiment analysis, as well.
Here are a few review questions to help you gauge your understanding of this chapter. Try to attempt these
‘questions and ensure that you can recall the points mentioned in the chapter.
Recurrent Neural Networks
2.1 Compare Feed-Forward Neural Networks and Recurrent Neural Networks. (4 Marks)
@.2 Describe the types of RNs. (6 Marks)
Long Short-Term Memory Networks (LSTM)
3 With a diagram, explain the general layout of a LSTM cell. (6 Marks)
Q.4 Describe LSTM gates. (4 Marks)
@.5 Explain the components of a LSTM unit. (6 Marks)
@.6 Describe the advantages of LSTM. (4 Marks)
Encoder Decoder Architectures
@.7 Explain encoder decoder architecture. (4 Marks)
2.8 — Write a short note on Sequence to Sequence Model. (4 Marks)
9 Where would you use seq2seq and why? (4 Marks)
@.10 Explain how seq2seq model works. (6 Marks)
Recursive Neural Networks
@.11 Explain Recursive Neural Networks. (@Merta)
@.12 Explain the network architecture of Recursive Neural Networks. (6 Marka)
2.13 With a diagram, describe the diference between Recurrent Neural Network and Recursive Neural Network. (4 Marks)
@.14 Describe the types of Recursive Neural Networks, sab
(4 Ma
@.15 Explain the applications of Recursive Neural Networks.© Types of Autoencoders
© _Undercomplete Autoencoders
© Regularised Autoencoders
© Sparse Autoencoders
© Denoising Autoencoders
© Stochastic Encoders and Decoders
© Contractive Autoencoders
Applications of Autoencoders
4.1 Autoencoder
4 Definition : Autoencoders are artificial neural networks capable of learning efficient representations of the
of
input data, called codings, without any supervision.
«These codings typically have a much lower dimensionality (depth of information) than the input data, making
autoencoders useful for dimensionality reduction and compression. These codings, or simply, the code is a
compact “summary’ or “compression’ ofthe input aso called the latent-space representation, For example, MP3
format encodes the audio file to a much smaller size than the raw .wav file. Similarly, JPG image format, when
compressing images, preserves the look and feel of the images without compromising a lot on the quality.
© More importantly, autoencoders act as powerful feature detectors, and they can be used for unsupervised
‘i 4 Lastly, they are capable of randomly generating new data that looks very
pre-training of deep neural networks: fs r
se ealled a generative model. For example, you could train an autoencoder on
similar to the training data; this is calle
pictures of faces, and it would then be able to generat
4.11 Basic Concept of Efficient Data Representation
ofthe following number sequences do you find the easiest a Ae
te new faces.
'* — Suppose that I ask you, which
© 40, 27, 25, 36, 81, 57, 10,
19, 58, 29, 88:
73, 19, 68
(4h, 22, VA, 34,17, 5226, 13,40, 20W Deep Learning (SPPU)
* _Atfirst glance, it would seem thatthe first sequence should be easier, since itis much shorter. However, if you look
Carefully atthe second sequence, you may notice that it flows two simple rules: even numbers are followed by
their half, and odd numbers are followed by ther triple plus one (this is a famous sequence known as the hailstone
Sequence). Once you notice this pattern, the second sequence becomes much easier to memorise than the frst
because you only need to memorse the two rues, the first number, and the length of the sequence. Isn't it beter
than to memorise the entire series of numbers?
* Now you could argue that that you are an expert in memorising very large sequence of numbers and hence you
don't care much about the existence of a pattern in the second sequence. You would just learn every number by
heart and that would be it. But, generally speaking itis the fact that it is hard to memorise long sequences and itis
useful to recognize patterns. That's the basic concept behind how autoencoder works to recognise the “hidden
patterns and then later use them to reconstruct the input.
Autoencoders
42 Components and Architecture of Autoencoders
‘+ An autoencoder consists of 3 components.
2. Encoder : It compresses the input into a latent space representation (the code). The encoder layer encodes
the input image as a compressed representation in a reduced dimension. The compressed image is the
distorted version of the original image.
2 Code : This is the compressed input (from encoder) whichis fed to the decoder for reconstructing the original
input later.
3. Decoder: It decades the encoded output, in the form of code, back to the original input. The decoded output
isa lossy reconstruction ofthe original input, and itis reconstructed from the latent space representation (the
code).The goal isto get an output as identical as was the input.
* Simply speaking, fist the encoder compresses the input and produces the code, the decoder then reconstruct the
input only using this code.
The Fig. 4.2.1 depicts these components or layers of the neural network in an autoencoder.
Input
Fig. 4.2.1W deep Learning (SPPU)
43
Fig. 42.2 oan
‘te. the code is also known as Bottleneck. This is a well-designed
bs
deen. served data are relevant information and what aspects can be
Input Layer Hidden Layer Output Layer
Training an Autoencoder
‘Autoencoders are trained the same way as ANS are trained. You need to set the following four parameters before
training an autoencoder.
Code size : It is the number of nodes in the middle (bottleneck) layer. Smaller size results in more
it may be difficult to make the size smaller beyond a certain limit to get satisfactory results.
a
compression and
2. Number of layers : As yo
an ANN, you need to decide how many layers autoencoder should have
3. Number of nodes per layer: You aso need to decide the number of rodes per layer ofthe autoencoder.
Typically, the number of nodes per layer decreases with each subsequent layer of the encoder and inc
ack in the decoder. Also the daceder is usually symmetric tothe ancodr In tarms of layer structure.
fou either use mean squared error (MSE) or
the range [0, 1] then you typically use cross-entropy, otherwise you use the mean
wu understand, the autoencoder can be as deep as you like. Very similar to training
4, Loss (Cost) function : ¥' Fy cross-entropy as the loss function.
If the input values are in
squared error.Autoencoders
Wedeep Learning (SPPU) 4
4.4 Features / Usage / Applications of Autoencoders
‘+ Autoencoders typically have the following features or usage.
1L._ Image colouring : Autoencoders can be used for converting black and white images into coloured images.
Depending on what the image is and what are the typical colours of the objects in that image, it's possible to
colour the image.
2. Feature extraction : Autoencoders extract only the required features of an image and generate the output by
removing any noise or unnecessary interruption. They can also be used for compression.
3. Dimensionality Reduction : The reconstructed image is similar to the input image but with reduced
dimensions (features). It helps in providing the similar image with a reduced number of pixels.
4. Denoising Image : A denoising autoencoder can be used to reconstruct the image by eliminating the noise
from the input image.
4.5 _ Types of Autoencoders
'* Ata high-level, autoencoders are of the following types.
1. Undercomplete
2. Regularised
8. Convolution
4. Sparse
7. Variational
‘8. Contractive
Fig. 45.1
4.5.1 Undercomplete Autoencoder
45 Definition: An autoencoder whose code dimensio: ir
fie mn is less than the input dimension is called undercomplete
‘¢ The simplest architecture for constructing an autoen
coder is to constrain the
\ number of nodes present in the
hidden ee Of the network, limiting the amount of information that can flow through the en lisine
ir al
the network according to the reconstruction error, the model can then learn the most i Oe ete
input data and how to best reconstruct the original input from an “encoded ss enportant attibuytestot the
describes latent attributes of the input data, State: Ideally, this encoding leas andW Deep Learning (SPPU)
Ay
Copying the input to the output may sound usel
less,
decoder. Instead you are tinny the san an you are typically not interested in the output of the
a
Perform the input copying task that will result in hidden
layer h taking on useful on
noe Properties (obtain useful features). One way to obtain useful features from the
rain h to have a ‘smaller dimension than the input x.
The Fig. 4.5.2 illustrates an undercomplete autoencoder.
plete ay
Hidden layer Cites
Input layer
4.5.2. Regularised Autoencoder
«Ideally, you could train any architecture of autoencoder successful, choosing the code dimension and the
capacity of the encoder and decoder based on the complasty of distribution to be modelled, Regularised
autoencoders provide the ability to do so. Rather than limiting the model capacity by keeping the encoder and
decoder shallow and the code size small, regularised autoencoders use alos function that encourages the model
other properties besides the ably to copy its input to is output. These other properties include sparsity
the derivative of the representation, and robustness to noise or to missing
to have
‘of the representation, smallness of
inputs, A regularised autoencoder can be nonlinesr and overcomplete but stil learn something useful about the
data distribution, even if the
coders such as sparse autoencoders and denoising autoencoders subsequently.
model capacity is great enough to learn a trivial identity function. You wil learn about
a few regularised autoen:
4.5.3 Convolution Autoencoders (CAE)
a signal can be formed as a sum of other signals Convolutional Autoencoders use the convolution
* Asyou know,
t into a set of simple signals and then try to reconstruct the input from those signals
operator to encode the input
using the convolution network
Wi
eesant 4xa2
Itis typically used for
1. Image reconstruction
3. Latent space clustering
Fig. 4.5.3
2, Image coloutization
4, Generating higher resolution images
4.5.4 Sparse Autoencoders (SAE)
saxt4xa2
One of the constraints that often leads to good feature extraction is sparsity. Using sparsity, you can push the
autoencoder to reduce the number of active neurons in the coding layer. For example, it may be pushed to have
on average only 5% significantly active neurons in the coding layer. This forces the autoencoder to represent each
input as a combination of a small number of activations. As a result, each neuron in the coding layer typically ends
up representing a useful feature (if you could speak only a few words per month, you would probably try to make
them worth listening to).
Input layers
Si) Hon yes
Ky
Wi
AWA
xf
K\
;
i
Ky
S
=<
=<
SS
Ss
ZE
StW Deep Learning (sPPU)
Autoencoders
re layers hel
careful not to make the aute PS the autoencoder learn more complex codings. However, you must be
Y you must
; encoder too powe
input to a single arbitrary number Powerful. Imagine an encoder so powerful that it ust learns to map each
will reconstruct the training dat - ‘he cecoder learns the reverse mapping). Obviously such aioie
; f
® Perfectly, but it will not have actually learned any useful data site _
represe tion in the
'0 generalise well to new instances). It much similar to re
Instances). It is very
; . simil
f to you memorising the enti
process (and it is unlikely
¢ The architecture of a stac i
cody pea ere ae enc is typically symmetrical with regards to the central hidden layer (the
Galiiceceneee looks like a sandwich. For example, an autoencoder may have 784 inputs, followed
Saran ons, then a central hidden layer of 150 neurons, then another hidden layer with 300
put layer with 784 neurons. Such a stacked autoencoder is illustrated as following.
150 units
784 units
Reconstructions
(inputs)
784 units
4.5.6 Denoising Autoencoders (DAE)
«You can force the autoencoder to learn useful features by adding nose tits inputs an then taining tt recover
revents the autoencoder from trivially copying its inputs to its outputs and so it
the original noise-free inputs. This PY
tends up having to find patterns in the data, So, you are forcing the autoencoder to subtract the noise and produce
the underlying meaningful deta, Such an autoencoder is called a denoising autoencoder.
© The Fig. 4.56 illustrates a denoising autoencoder.
Encoder Cecoder
Noise
tay tbl Code Output
Original ImageDeep Learning (SPPU) 48 ‘Autoencodery
4.5.7 Variational Autoencoder (VAE)
‘© Variational autoencoders are more modern and complex. They are quite different from all the autoencoders that
you have learnt so far in the following respect.
1. They are probabilistic autoencoders, meaning that their outputs are partly determined by chance, even after
training (as opposed to denoising autoencoders, which use randomness only during training).
2. Most importantly, they are generative autoencoders, meaning that they can generate new instances that look
like they were sampled from the training set.
‘* Let's understand how it works with the help of a Fig. 4.5.7.
Fig. 4.5.7
* You see the typical structure of an autoencoder but this time with a twist. Instead of directly producing a coding
for a given input, the encoder produces a mean coding 4 and a standard deviation «. The actual coding is then
sampled randomly from a Gaussian distribution with mean y and standard deviation o, After that, the decoder just
decodes the sampled coding normally. The right pat ofthe diagram shows a training instance going through this
autoencoder. First, the encoder produces H and o, then a coding is sampled randomly (notice that it is not exactly
located at n), and finaly this coding is decoded, and the final output resembles the training instance.
4.5.8 Stochastic Autoencoder
* In stochastic autoencoder, both the encoder and the decoder are not simple functions but instead involve some
Roise injection. The output can be seen as sampled from a distribution, Pencoder(h | x) for the encoder and
Pencoder’ | h) forthe decoder where h isthe code (hidden layer) and xis the input (as well as target for decoded).
The output variables are treated as being conditionally independent given h so that this Probability distribution is
inexpensive to evaluate, but some techniques, such as mixture density outputs, allow flexible modelling of outputs
with correlations.
WrPencoder(hls) P decover(IN)
Fig. 4.5.8
‘* So, mathematically, you can say that
Stochastic Pencoder(h 19) = Prmodei(h | x)
Stochastic Puecoderit|h) = Pinogai(t | h)
45.9 Contractive Autoencoder
‘The objective of a contractive autoencoder is to have @ robust learned representation which is less sensitive to
small variation in the data, Robustness of the representation for the data is done by applying a penalty term to the
loss function. Contractive autoencoder is another regularisation technique just ike sparse and denoising
‘autoencoders. Contractive autoencoder is a better choice than denoising autoencoder to learn useful feature
extraction. The model learns an encoding in which similar inputs have similar encodings. Hence, you are forcing the
model to learn how to contract a neighbourhood of inputs into a smaller neighbourhood of outputs.
* You can explicitly train your model by requiring that the derivative of the hidden layer activations are small with
respect to the input. In other words, for small changes to the input, you should still maintain a very similar encoded
State. This is quite similar to a denoising autoencoder in the sense that these small changes to the input are
essentially considered noise and that you would like your model to be robust against noise. Simply speaking,
denoising autoencoders make the reconstruction function (decoder) resist small but finte-sized changes in the
input, while contractive autoencoders make the feature extraction function (encoder) resist small changes in the
input.
© The Fig. 4.53 illustrates contractive autoencoder. —— ain observations
—— _Leamed reconstruction
fuction
Linear identity function
(perfect reconstruction)
(x)
Similar inputs
and contracted
to a constant
output within a
neighborhood,
based on what
the model
observed during
training
Fig. 4.5.9At the end of this unit
ee bi “ ae be able to understand and comprehend the following syllabus topics:
training
Transfer Learning and Domain Adaption
* Distributed Representation
Variants of CNN : DenseNet
5.1__ Representation Learning (Feature Leaning)
4% Definition : Feature learning or representation learning is a set of techniques that allows a system to
automatically discover the representations needed for feature detection. or classification from raw data.
© Itreplaces manual feature engineering and allows a machine to both leam the features and use them to perform a
specific task. Before you learn more about representation learning, let's quickly learn (recap) about feature
engineering so that you understand why representation earning is extremely useful
5.1.1 Feature and Feature Engineering
Before diving into feature engineering, let's take a moment ‘and do a recap of what you have leamt so far. Let's
take a look at the overall machine learning pipeline.
Data
ns of real-world phenomena. For instance, stock market data,
collection of observatio
stock prices, announcements of earnings by individual ‘companies, and even
it information on’ matches, environment in which those
inion arti wundits, Sports data could have in vin
rai oi ine nae performances, and several other observations Similarly, personal biometric data can
Naha iad mee your minute-by-minute heart rate: blood sugar level, blood pressure, oxygen level, etc.
aria examples of data across different domains.
‘a small window into a limited aspect of reality. The collection of all of these
of the whole. But the picture Is messy because it i composed of a thousand litle
‘ent noise and missing pieces.
©The raw data, or just data is @ coll
might involve observations of dally
You can come up with endless
«Each piece of data provides
observations gives you @ picture
rere’s always measurerRepresentation Learning
hat data can help YOu ans
Why do you collect data? I am sure you would say that there are several questions that
(or predict). Some of the populat questions are
‘+ How likely is that a customer buying product A will also buy product B?
© Which team is likely to win?
‘* How will be the weather next month?
* What food you should eat to get healthier?
‘© What is the risk of getting diabetes based on your biometric data?
The path from data to answers is full of false starts and dead ends.
Fig. 5.1.1
‘+ What starts out as a promising approach may not work in reality. What was originally just a hunch may end up
leading to\the best solution. Workflows with data are frequently multistage and iterative processes. For instance,
stock prices are observed at the exchange, aggregated by an intermediary like Thomson Reuters, stored in a
database, bought by a company, converted into a Hive store on a Hadoop cluster, pulled out of the store by a
Script, subsampled, massaged, and cleaned by another script, dumped to a file, and converted to a format that you
«an try out in your favourite modelling library in R, Python, or Scala. The predictions are then dumped back out
2 CSV file and parsed by an evaluator, and the model is iterated multiple times, rewritten in C++ or Java by yout
Production team, and run on all of the data before the final predictions are pumped out to another database.
* However, if you disregard the mess of tools and systems for a moment, you might see that the process involves
two mathematical entities that are at the centre of machine learning - models and features.
Models
* Redundant data contains multiple aspects that conv
may be present as a categorical variable with values
integer value between 0 and 6. If this day-of- week i
got missing data on your hands.
*y exactly the same information, For instance, the day of wee*
of "Monday," Tuesday," "Sunday," and again included a 2”
Information is not present for some data points, then you N=¥®tion:
tbe @ formula ion between different aspects ‘of the data. For instance, a
‘A model that ® companys eaming history, past stock prices, and
But, for most machine learning mathematic model
Ks,
in various computation. , features are required to be numeric so that they can be used
* So, let's redefine features as,
45 Definition : A feature is a numeric representation of raw data.
© Asyou know, the features in a data set are ts dime a data set having mfeatures is called an n -
7 also called jensions. So a
i ing
dimensional data set. For example, consider the following data set.
Gender "| Marks
Girt 65
Girt 46
Boy | 56
Boy 4a
53
Boy 4
Girt a2
Boy 84
Boy 4
Girl 42
Girl 40
right? Yes, this is a two-dimensional data set, or you could
aying out loud that “hey look, the gender field is not
‘where feature engineering would come in where you would understand how you
to something more meaningful and computationally more appropriate. For
“1" for girls. Now that is numeric, isn't it?
ions does it have? Two,
© How many features or dimensi
2 features. I know you are 5
also say that this data set has
understand, that is
raw data set in
value of 70" for boys and a value of
numeric”.
could convert this
example, you could assign #Representation Learning J
Momence
'
5.1.2 Feature Engineering
Into numeric measurements (I just showed you one earlier), which is
© There are many ways to turn raw data it
features can end up looking like a lot of things. Naturally, features must derive from the type of data that i,
available, Features are also tied to the model. Some models are more appropriate for some types of features, ang
vice versa. The right features are relevant to the task at hand and should be easy for the model to ingest.
2S Definition : Feature engineering is the proceas of formulating the most appropriate features given the day
‘the model, and the task.
The Fig 5.1.2 depicts where feature engineering sits in the machine learning pipeline.
Feature Engineering
Fig. 5.1.2
Features and models sit between raw data and the desired insights.'In'a machine leaming workflow, you pick not
only the model, but also the features. This is a double-jointed lever, and the choice of one affects the other. Good
features make the subsequent modelling step easy and the resulting model more capable of completing the
desired task. Bad features may require a much more complicated model to achieve the same level of performance.
The number of features is also important. If there are not enough informative features, then the model will be
unable to perform the ultimate task. If there are too many features, or if most of them are irrelevant, then the
model will be more expensive and trickier to train. Something might go wrong in the training process that impacts
the model's performance.
Feature engineering typically includes feature creation, feature transformation, feature extraction, and feature
t
Se 8
%
Fig. 5.1.3
selection.«Feature creation identities the feay 55
tures in
Feature transformation the dataset
manages re that are relevant to the
problem at
hand.
: placing
© Feature extraction is the eee
ei res or
the dimensionality of the ‘eine svar
res,
© Feature sel levant
re selection is the filtering of ima
observing variance o correlation ‘:
5 thresh
« Atahigh-level, the feature. wees lolds to
9
eating new fea
"ures from existing features, typically withthe goa of reducing
redundant features fre
}om your dataset. This is usually d
determine ‘which features to remove. ma
Process looks like shown in Fig, 5.14,
Fig. 5.14
Data Engineering -vs- Feature Engineering
Often raw data engineering (data pre-processing) is confused with feature engineering. Data engineering is the
prepared. data. Feature engineering then:tunes the prepared data to create the
‘odel. These terms have specific meanings as outlined here.
process of converting raw data into
features expected by the machine learning m«
© Raw data (or just data) = This refers to the data in its source form, without any prior preparation for machine
leaming, Note that in this contest the data might be in its raw form (in a data lake) of in a transformed form
(ina ae warehouse) Transformed data in 2 data warehouse might have been converted from its original raw
re but in this contest, it means that the data was not prepared specifically for your
analytics
eer ‘a vo ah is dition, data sent from streaming systems that eventually call machine learning models
machine learn! Ps
jictions is considered
se ee aie refers to the dataset In the form ready for your machine learning task Data sources have
: i gil rnd put into 2 tabular form. Data has been aggregated and summarized to the right
see the dataset represents a unique customer, and each column represents
aranularity-for example, e260 SO™ -. natesDeep Learning (SPU) =
information for the customer, like the total spent in s ‘9
ax fast a, is present. relevant columns have been dropped, and invalid records have been fitereq
out.
Engineered features : This refers to the dataset with the tuned features expected by the model—that i,
performing certain machine leaming specific operations on the columns in the prepared dataset, and creating ney
features for your model during training and prediction. Some of the common examples are scaling numerical
columns to a value between 0 and 1, clipping values, and one-hot-encoding categorical features.
In practice, data from the same source is often at different stages of readiness. For example, a field from a table in
your data warehouse could be used directly as an engineered feature. At the same time, another field in the same
table might need to go through transformations before becoming an engineered feature. Similarly, data
‘engineering and feature engineering operations might be combined in the same data pre-processing step.
ring and feature engineering tasks.
‘The Fig. 5.2.5, highlights the placement of data en;
Feature
Engineering
Fig. 5.1.5
5.2 _ Introduction to Representation Learning
* Now that you understand how features are used to develop machine learning models and why feature engineering
is required, let’s resume our discussion on representation learning.
What is a Representation?
* Assume that you come across a dish by the name "chicken tortilla soup”. As soon as you read the dish name, in
your mind, you had some sort of representation about what the’ words chicken tortilla soup ‘could 'mean, even
though there was no mention of the ‘soup’s particular taste, ‘texture; toppings, ‘serving size, appearance,
temperature, ingredients, molecular composition, or any other specifics, The 21 letters-and, spaces making uP
“chicken tortilla soup" conveyed sufficient information for you to get a fair idea about what could it look and
Possibly taste like,
* Obviously, those 21 characters are not the soup itself that you can consume and tell what it tastes lke, They are 4
‘epresentation of the abstract concept of a chicken tortilla soup. The representation ‘itself ‘consists only of
meaningless symbols like "c’ and "h" from the Latin alphabet, organised according to the conventions of written
English. The concept is an abstract one because it does not contain information specific to concrete examples of
the soup -- taste, texture, toppings, etc,Deep Learning (SPPU)
57
Fig. 5.2.3
The representation makes sen
representations constructed fort YOU! Only because iti coherent with the pattems of organisation of other
ie mike Symbols and conventions. Combinations of the same symbols that are
mies Of written English, such as “ickchenk orttaall pous” and “nekcihe
ngish :
same letters that chicken tortila soup hag NY Son" fePresetanyting (eventhough they have the exact
If you search online for
allitrot puos* are meaningless in
a n resentations too, made up of the same meaningless symbols from
the Latin alphabet, organised into complicated patterns of paragraphs ieee
the conventions of the language. Sernces, ubaen, andl words scconeng t
You could use a different set of conventions for constructing representations (eg, the conventions of French,
German, or Spanish) or even a different set of symbols (e.g, logograms from Chinese Hanzi, or dots and dashes
from Morse Code), but the representation of a chicken tortilla soup must be coherent with the patterns’ of
organisation of all other representations constructed from the same symbols and conventions. The entire structure
of written language, as it were, rests on those patterns of organisation. Indeed, those patterns must be coherent
with each other for us to understand written language.
This is true not only for all written languages you read with our eyes (or fingertips, in the case of Braille) but also
for all other kinds of representations’ you perceive through your senses: For example, the waves of changing air
pressure that reach your ears|when someone nearby says “chicken tortila Soup" must be coherent with the
patterns of organisation of other sounds you hear 8thervse, you would not understand the spoken language.
Similarly, the many millions of photons that hit your retinas every second you look ata chicken tortila soup must
be coherent with the patterns of organisation of the. many other millions of photons that reach your eyes after
bouncing off other visible objects; otherwise, you would not understand what you.see,
it: must be. true for everything you think too. The neuron, activations in your brit that represent the thought
"chicken tortilla soup” must be coherent with the pattems of organisation (of other neuron activations in your brain;
i ele to think ofthe Soup. Hence you shouldbe able to use ather symbols instead of
otherwise, you would not sent your thoughts. The notion might seem farfatched, but in fact thats precisely
ee ET net symbols of written language. You represent thoughts with them!
what you do wit
‘What is Representation Learning?
ing" (deep or shallow it means machine fearing in which the goal is to learn
arni
© When you say, "representation 16 ew representation that retains information (features)
aU inal representation to @ pr
to transform data from its orig
jhile
interest to you. I
essential to objects that are of er you tanstorm ve
into a new representat
discarding other information (features) The transformation is
‘neuron activations in your brains, representing a nearby
analogous to the,manner in tion consisting of 21 characters in written English, The new
Uy
estaurant's chicken tortilla so
ee OePr Representation Leaning
Deep Learning (SPPU)
sntation in written English retains the essential features of icken tortilla soup (the abject of interest) ug
representation it
sr ther information ncuding deal spec to the nearby restaurant's particular version ofthe soup,
dis is inf -
scar tions in a computer consist of sequences of binary digits, orbits, typically organised into floating-poig
Geen cies of bits, digital ones and zeros, as symbols for the construction of representations insg
oe mers is not important for your purposes. What Is Important is that the new representations, each ong
consitng of a sequence of bits, must be coherent in the sense explained earlier with the patterns of organisation
other bit representations of input data transformed in the same manner.
see ‘A possible representation of "chicken tortila soup"
TOMO PIOOIOPIOAOOOIDIDIOG -«-
pooper Ole 7 emer
Meaningless symblois used to construct
representations inside a computer
Fig. 5.2.2
The entire structure of the new representation scheme, as it were, rests on the patterns of organisation of all
possible sequences of bits representing different objects in the new scheme. Indeed, those patterns must be
‘coherent with each other for there to be a representation scheme.
Need for Representation Leaning
The performance of machine learning methods is heavily dependent on the choice of data representation (or
features) on which they are applied. For that reason, much of the actual effort in deploying machine leaming
algorithms goes into the design of pre-processing pipelines and data’ transformations that result in a
representation of the data that can support effective machine leaming. Such feature engineering is important but
labour-intensive and highlights the weakness of current learning algorithms that is their inability to extract and
‘organise the discriminative and useful information from the data. Feature engineering is a way to take advantage
of human creativity and prior knowledge to compensate for that weakness.
In order to expand the scope and ease of applicability of machine learning, itis highly desirable to make learning
algorithms less dependent on feature engineering, so that novel applications could be constructed faster. Wouldn't
it be nice if you could automatically discover from the dataset what features are important?
In representation learning, data is sent into the machine, and it learns the representation on its own. It is a way of
determining a data representation of the features, the distance function, and the similarity function that
determines how the predictive model will perform. Representation learning works by reducing high-dimensional
data to low-dimensional data, making it easier to discover patterns and anomalies while also providing a better
understanding of the data’s overall behaviour.
How Deep Neural Networks Learn Representations (How Representation Learning Works)
‘As you know, you train deep learning neural networks to predict something for a given new input. Any deep neural
Network learns to make predictions via an iterative training process, during which you repeatedly feed sample
input data and gradually adjust the behaviour of all layers of neurons in the neural network so that they jointly
leam to transform input data into good predictions. You get to decide what those predictions should be and also
specify precisely how to measure their goodness to induce learning.each: 39
ati layer of
represer it eur
tation is itself an inpuy enon (8M to transform is
'PUt to subsequent {ts Inputs into: a diferent representation: This
representations, Eventuall
2) 1. you ‘
predictions (however you had reach the final neo in turn, lear to transform it into yet other
q learns
your layer, in effec,
reference inthe folowing cers to cp thre Lye’ fo ee ede. Ths
ener pan lagram,
turing training
‘Sample
rele |
= Prada, Denes
7 output
Each layer lea
saat amare ae
Fig. 5.2.3
So, how can you make a d
leep neural network lea ;
= you discard the predicti IM representations and output that instead of predictions? Simple
outputs replscernations eu just keep the representations that it leamt during the course of training, It
cute renerematons ised of edicts when peed wh ep dt Fr eal ets at you
retwork that has already learned
tech a ta auitted ly leamed to make good predictions (however
9 training). You then slice off its last layer, as shown in the following figure, and that's aaa eh
with a slightly shorter neural network, which, wh iy abstract representation of it instead
whi
hy jch, when fed input data, outputs an ion of it it
Deep neural net
(already trained)
In this example, we discard
tte original final layer that
F OD ‘outputs predictions
‘Our new final layer outputs
‘abstract representations
Fig. 5.2.4
You can use the representations produced by this, shorter neural network for a purpose different than, the
prediction objectives you specified to induce learning in the training phase. For example, you could feed the
representations as inputs to a second neural network for a different learning task. The second neural network
would benefit from the representational knowledge learned ‘by the first one assuming that the prediction
objectives you specified to train the first neural network induced learning of representations useful for the second
one.
If the main or only purpose of trainin
prediction objectives are more accurate
representation learning. The training objectives you
‘a neural net (eg. number, sizes, types, "
representations the neural met leas. Your job is to come uP wit!
that together induce learning of the kinds of seeps ee
ici se, but to learn useful representations of input dst2-
predictions per s& —
9 2 neural network isto learn useful representations, then the discarded
ely described as training objectives. This is the essence of deep
pecify to induce learning in the training phase, along withthe
and arrangement of layers), determine what kinds of
Lh training objectives and suitable architectures
want as a by-product. The goal is not to make
architecture ofRepresentation Learning
Deep Learning (SPPU) 5-10
Example — Learning Representations of Objects in Images
+ Let's take a simple example of representation learning to un
‘autoencoder. As you know, autoencoders are artificial neural networks ¢
of the input data, called codings, without any supervision. You train au
‘autoencoder but drop the decoding part so as to only get codings that are
diagram illustrates how you can use autoencoders for representation learning Middle layer
‘tor Training:
x
by Predicting Reconstructions
tand how it works. Assume that YoU have ay
ot apable of learing efficient representation,
roencoders as you would do for a regu,
the representations, The following
During Training:
‘Sample
‘input icaie layer Predicted Desired i
ee
mH Hl
ew
(Compression) (Generation) Weaeene ined
for generating images —
seers rs
‘mos comerestnd reposetaion re miler SS
Fig. 5.2.5
+ When you train an autoencoder with a large number of examples (say, a few million digital photos), it leams to
compress input images into small representations in the middle layer that encode patterns of organisation of the
Portrayed objects necessary for regenerating close approximations of those images. After training, you discard all
layers following the middle one (codings). You are then left with a neural network that transforms images with
millions of pixels into a small number of values that represent the objects in those images.
Why and When is Deep Representation Learning Necessary?
* There are three major reasons to use deep representation leaming as following.
1. Ifyou don't have enough training data.
2. Ifyou have zero examples for many categories of the objects of interest.
3. If your problem requires a model more computationally complex than feasible.
+ all cases, it's assumed thatthe problem at hand is complicated enough to require deep learning.
What Makes a Representation Good?
+The following priors (factors that are desired/assumed to be
ze Present) play a ke i
representation by learning a function f that maps input x to output Teas fy pe exten a good
‘or more ofthese priors to learn to output representations suited toa specific task. ere
shared) across examples. atures are reused (achieved by parameters beingDeep Learning (SPPU) -
e shih ‘Organisation of explanatory factors + The concepts that are useful for describing the world
hiera en be defined in terms of other concepts, in a hierarchy, with more abstract concepts higher in the
rey. in terms of less abstract ones. This assumption is exploited with deep representations.
4)” Semi-supervised learning = with inputs x and y to predict, » subset ofthe factors explaining x’ distribution
lin much of y, given x. Hence representations that are useful for p(t) tend to be useful when learning
Ply x) allowing sharing of statistical strength between the unsupervised and supervised learning tasks.
5) Shared factors across tasks -+ With many y's of interest or many learning tasks in genera, tasks (eg. the
corresponding ply | x, task) are explained by factors that are shared with other tasks, allowing sharing of
statistical strengths across tasks.
©) Manifolds — Probability mass concentrates near regions that have a much smaller dimensionality than the
‘original space where the data lives, This is explicitly exploited in some of the autoencoder algorithms and
‘other manifold-inspired algorithms,
7) Natural clustering — Different values of categorical variables such as object classes (e.g. cats, dogs) tend to be
associated with separate manifolds. Each manifold is composed of learned representation of an object class
(say dog, cat). So moving along a manifold tends to preserve the value of a category (e.g. variations of dog
‘when moving on the “dog* manifold). Interpolating across object classes would require going through a low
density region separating the manifolds. In essence, manifolds representing object classes tend not to overlap
much. This factor is exploited in machine learning.
8) Temporal and spatial coherence —» Identifying slowly moving or changing features in temporal/spatial data
could be used as a means to learn useful representations, Even though different features change at different
spatial and temporal scales, the values of the categorical variables of interest tend to change slow. So this
prior can be used as a mechanism to force the representations to change slowly, penalising change in values
of categorical variables over time or space.
9) Sparsity + For a given observation x, only a small set of possible features are relevant. This could captured in
the representation by features that are often zero or by the fact the extracted features are insensitive to
variations of x. Sparse autoencoders use this prior in the form of a regularisation of the representation.
10) Simplicity of Factor Dependencies —> Ifa representation is abstract enough, the features may relate to each
other through simple linear dependencies. This can be seen in many laws of physic, and ths is the prior that
is assumed when stacking a simple linear predictor on top of a learned representation that is rich and abstract
enough.
5.3 Greedy Layer-wise Pre-training
arming played a Key historical role inthe revival of deep neural networks, enabling researchers for
© Unsupervised le
etwork without requiring architectural specialisations like convolution or
the first time to train a deep supervised n ;
recurrence. You can call this procedure unsupervised pre-training, or more precisely, greedy layer-wise
insuperized pre-training. This proceaure isan established example of how a representation learned for one task
rer i trying to capture the shape ofthe input astibution) can sometimes be useful for another
(unsuy
task (supervised learning with the same input domain)