0% found this document useful (0 votes)
179 views50 pages

DL Half TechKnowledge

DL notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
179 views50 pages

DL Half TechKnowledge

DL notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 50
Recurrent Neural Networks ‘At the end of this unit, you should be able to understand and ‘comprehend the following syllabus topics: * Recurrent Neural Networks © Types of Recurrent Neural Networks © Feed-Forward Neural Networks vs Recurrent Neural Networks Long Short-Term Memory Networks (LSTM) Encoder Decoder architectures Recursive Neural Networks 3.1 Recurrent Neural Networks (RNNs) “1 tecture” in Unit 1, You have already learnt about Recurrent Neural Networks in “Neural Network (NN) architecture” in Un fou have alr Section 1.3 . You are good! 3.1.1 Types of Recurrent Neural Networks more types of RNNs. had learnt about simple RNN and deep RNN. Let's eam about some + Earlier you 1.One-to-One Deep Learning (SPPU) Recurrent yural Networks 1. One-to-One RNN The simplest type of RNN is One-to-One, which allows a single input and a single output. It has fixed input and output sizes and acts as a traditional neural network. An example of One-to-One RNN application is image classification. The Fig. 3.1.2 illustrates One-to-One RNN. Fig, 3.1.2 2. One-to-Many RNN le input. It takes a fixed input size ‘One-to-Many is a type of RNN that gives multiple outputs when given a and gives a sequence of data outputs. Its applications can be found in music generation and image ning. The Fig. 3.13 illustrates One-to-Many RNN. 3. Many-to-One RNN Many-to-One RN is used when a single output is required from multiple input units or a sequence of them. It takes a sequence of inputs to give a fixed output. Sentiment analysis is a common example of Many-to-One RNN. The Fig, 3.1.4 illustrates Many-to-One RN. Input Wms rom a it Ne ‘Sequer . to-Many RNN with equal unit size is hanes Output Units is the same. A oh of ra Units. In the case s 5 Cognit MON applicati «The Fig. 3.5 illustrates Many-to-Many RNN with equal ion. ication of Many- unit size. Fig. 3.1.5 5. Many-to-Many RNN with Unequal Unit Size «© Many-to-Many RNN is used to generate a sequence of output data from a sequence of input units Inthe case of unequal unit size, inputs and outputs have different numbers of units. A common application of Many-to- Many RNN with unequal unit size is machine translation. «The Fig. 3. illustrates Many-to-Many RNN with unequal unit size Fig. 3.1.6 Neural Networks rks VS Recurrent 22 eedsForuent earl 2 ard neural networks and recurrent neural networks. between feed-forw’ . quick comparison oe The Table 3.1.1 provides @ Recurrent Neural Networks Woe armas ————Eee“—: Table 3.1.1 ‘Comparison Attribute Feed-forward Neural Networks Recurrent Neural Networks Signal flow direction Forward only Bidirectional Delay introduced No Yes Complexity Low High Neuron independence in same | Yes No layer Speed High Slow Commonly used for Pattern recognition, speech | Language translation, speech to recognition, and _ character | text conversion, and robotic control recognition 3.2 Long Short-Term Memory Networks (LSTM) You previously got a brief about LSTM. Let's dive a little deeper in this section. Fig. 321 shows a simple diagram of a LSTM cell. Candidate gate (2 elements) Output = / | Fig. 3.2.1 LSTM networks are the most commonly used variation of Recurrent Neural Networks (RNNs). The critical component of the LSTM is the memory cell and the gates (including the forget gate but also the input gate). The contents of the memory cell are modulated by the input gates and forget gates. Assuming that both of these gates are closed, the contents of the memory cell will remain unmodified between one time-step and the next. The {gating structure allows information to be retained across many time-steps, and consequently also allows gradients to flow across many time-steps. This allows the LSTM model to overcome the vanishing gradient problem that ‘occurs with most Recurrent Neural Network models. ‘The LSTM uses three internal neural networks as respective gates. 35 Recurrent Neural Networks eaves) Input. i the state memory. The LSTM doesn’t require repeated copies of itself, like the basic recurrent cell. So it avoids the problems of vanishing and exploding gradients. You can place this LSTM cell on a layer and train the neural networks inside it using normal backpropagation and optimisation, 3.2.1 LSTM Gates ‘As you understand, there are three gates used in LSTM. 1 Forget Gate * At forget gate the input is combined with the previous output to generate a fraction between 0 and 1, that determines how much of the previous state need to be preserved (or in other words, how much of the state should be forgotten). This output is then multiplied with the previous state. «Note: An activation output of 1.0 means “remember éverything” and activation output of 0.0 means “forget everything.” From a different perspective, a better name for the forget gate might be the “remember gate"! 2. Input Gate * Input gate operates on the same signals as the forget gate, but here the objective is to decide which new information is going to enter the state of LSTM. The output of the input gate (again a fraction between Oand 1) is multiplied with the ‘output of tan h block that produces the new values that must be added to previous. state, This gated vector is then added to previous state to generate current state, 3, Output Gate ‘© At output gate, the input and P combined with the output of tan! back into the LSTM block. revious state are gated as before to generate another scaling fraction that is h block that brings the current state. This output is then given out. The output and state are fed (serv) 36 Recurrent Neural Network 3.2.2 LSTM Units (Components of LSTM) ‘The units in the layers of Recurrent Neural Networks are a variation on the classic artificial neuron. Deep Learni ‘* Each LSTM unit has two types of connections: 1. Connections from the previous time-step (outputs of those units) 2. Connections from the previous layer ‘+The memory cell in an LSTM network is the central concept that allows the network to maintain state over time, The main body of the LSTM unit is referred to as the LSTM block, as shown in the following Fig. 32.3. ‘Connection with time-tag © Branching point @ Multiplication @ sum over alinputs Gate activation functon (always sigmoid) input activation {Uncton (usualy tanh) Output activation function (sual tanh) mat Recurrent Fig. 3.2.3 ‘¢ The components in an LSTM unit are as following. Three gates © Input gate (input modulation gate) © Forget gate © Output gate © Block input © Memory cell Output activation function Peephole connections There are three gate units, which leam to protect the linear unit from misleading signals. 1. The input gate protects the unit from irrelevant input events. 2. The forget gate helps the unit forget previous memory contents. 3, The output gate exposes the contents of the memory cell (or not) at the output of the LSTM unit, Deep =e wa Recurrent Neural Networks ‘The output of eee : s ae block is recurrent connected back to the block input and all of the gates for the LSTM the (Sn tice "5et and Output gates in an LSTM unit hve sigmoid activation functions for 0, 1] retin, Impwt and output activation function (usually stan hactation function 3.2.3 Advantages of LSTM + Tespecfc gated architecture of ist i designed to improve all the following shortcomings of the classical RNN. 1. AVvoi explodi . : Void the exploding and vanishing gradients, specifically with the use of forget gate at the beginning. 2 Long term memories can be preserved ‘along with learning new trends in the data. This is achieved through Hon of gating and maintaining state as separate signal > Yio information on states isnt required and the model is capable of learning from default values. 4 Unlike other deep learning architectures, there are ot many hyperparameters needed to be tuned for model optimisation, 3.3__ Encoder Decoder Architectures ‘+ Encoder-Decoder models are a family of models which learn to ‘map datapoints from an input domain to an ‘output domain via a two-stage network, 1. The encoder, represented by an encoding function z = fo), compresses the input into a latent-space representation, 2. The decoder, y = g(z), aims to predict the output from the latent space representation. The latent representation here essentially refers to a feature (vector) representation, which is able to capture the Underlying semantic information of the input that is useful for predicting the output. * These models are extremely popular in image-to-image translation problems, as well as for sequence-to-sequence ‘models in natural language processing (NLP) where you can translate say English to French. The Fig. 33.2 illustrates the block-diagram of a simple encoder-decoder model. Input image Latent Output map Fig. 3.3.1 * These models are usually trained by minimising the re construction loss, Ly, y), which measures the differences between the ground-truth output y and the subsequent reconstruction y. The output here could be an enhanced version of the image (such as in image de-blurrin, or super-resolution), or a segmentation map. Autoencoders are special case of encoder-decoder models in which the input and output are the same. You will learn about autoencoders later in the book. 3.3.1 Sequence to Sequence Model (seq2seq) + A-sequence to sequence (seq2seq) model aims to map a fixed-length input with a fixed-length output where the length of the input and output may differ, For example, translating "What are you doing today” from English to Chinese has input of 5 words and output of 7 symbols (maz ttt 7) - 7 Recent | 4 Deep Learning (SPU) cle Tanto mie] peace! gfe seu PEEETEO GLIDER RNC w ° cvmeseionaruneo) o1eus What are you doing today x SR INET ° em, ° t ‘+ Clearly, you cannot use a regular LSTM network to map each word from the English sentence to the Chinese sentence. This is why the sequence to sequence model is used to address problems like this one. * A sequence to sequence model lies behind numerous systems which you use on a daily basis. For instance, ‘seq2seq model powers applications like Google Translate, voice-enabled devices and online chatbots. Generally speaking, seq2seq has the following applications, 1. Machine translation : Machine translation is the task of automatically converting source text in one language to text in another language. Given @ sequence of text in a source language, there is no one single best translation of that text to another language. This is because of the natural ambiguity and flexibility of human language. This makes the challenge of automatic machine translation difficult, perhaps one of the most difficult in machine learning, 2. Speech recognition : You must have used speech recognition in products such as Amazon Alexa, Apple Siri, and Google Home. Speech recognition helps you to carry out voice-assisted tasks and the machine typically “understands" what you need or are asking for. 3. Video captioning : You can automaticaly generate video captioning based on what is happening in the video or what is being said in various languages. For example, the video could be in English, but you can ‘caption it in Hindi. 4. Question-Answer problems : These could be chatbot applications or voice assistants that respond to your ‘questions with probable answers. You would have experienced it in customer care section of any e-commerce application. 3.3.2 How the Sequence to Sequence Model ‘works? * Sequence to Sequence models use encoder-decoder architecture, For simplicity, consider the following encoder- decoder architecture. W Deep Learning (SPPU) Recurrent Neural Networks The model consists of 3 parts ag following, 1. Encoder : Encoder is a ns ; each accepts a single ee Of Several recurrent units (LSTM or GRU cells for better Performance) where it forward, In questior cond Amoco ton wn Each Word is represented sen P*ObleM. the input sequence is «collection ofall words from the question, 25 x where i Blowing teria 's the order of that word. The hidden states hy are computed using the Be = FEED he Wet x) 2. Intermediate Rwardincs fencodery ‘Vector : This is the final hidden state produced from the encoder part of the model. now being the formula above, This vector aims to encapsulate the information for ll input elements bow P the decoder make accurate predictions, I acts asthe intial hidden state ofthe decoder part 3. Decoder : Like encoder, decoder is a also a ata time step t. Each recurrent unit accep as well as its own hidden state. In the words from the answer, Each word is network of several recurrent units where each predicts an output ts a hidden state from the previous unit and produces and output question-answering problem, the output ‘sequence is a collection of all represented as y; where {is the order of that word. Any hidden state hy is computed using the following formula, he = fOWNM he) As you can see, you are just using the previous hidden state to compute thé next one. The ‘output y, at time step t is computed using the following formula. Ye =. softmax (W5h), You calculate the outputs using the hidden state at the current time step together with the respective weight W6). Softmax is used to create a probability vector which will help you determine the final output (e.g. word in the question-answering problem). The power of this model lies in'the fact that it can map sequences of different lengths to each other. As you can see, the inputs and outputs are not correlated, and their lengths can differ. This opens a whole new range of problems which can now be solved using such architecture. 34 _ Recursive Neural Networks Note; Do not get confused between Recursive Neural Networks and Recurrent Neural Networks (RNN). They have the ‘sme abbreviation but are different. To avoid confusion, | would use the Recursive Neural Networks in full forall my references. * Recursive Neural Networks, like Recurrent Neural Networks, can deal with variable length input. The primary difference is that Recursive Neural Networks have the ability to model the hierarchical structures in the training dataset. It is constructed in such @ way that it includes applying the same set of weights recursively over the different tree-like structures. Recursive neural networks generalise the recurrent neural networks from a chain-like structure to a tree-like structure. Fe Je, images commonly have @ scene composed of many objects. Deconstructing scenes is often a Ie fan vain of interest that is nontrivial. The recursive nature of this deconstruction challenges us to not only lem domai Ee but also how the objects relate to form the scene. identify the objects in the scene, os Recurrent Neural Networks, Learning (SPPU) The Fig. 3.4.1 illustrates the difference between Recursive Neural Networks and Recurrent Neural Networks. Raa Por onhawenficiesf that movie was cool (b) Recursive Neural Network Fig. 3.4.1 3.4.1 Network Architecture + ARRecursive Neural Network architecture is composed of a shared-weight matrix and a binary tree structure that allows the recursive network to lear varying sequences of words or parts of an image. It is useful as a sentence ‘and scene parser. Recursive Neural Networks use a variation of backpropagation called backpropagation through structure (BPTS). The feed-forward pass happens bottom-up, and backpropagation is top-down. Think of the Objective as the top of the tree, whereas the inputs are the bottom. * The Fig. 3.42 illustrates the network architecture of Recursive Neural Networks, ys Input: x1, x2, x3, . x5 " he ht Fig, 3.4.2 1g, to understand better. © Let's take a simple architecture, such as the follo\ ‘Deep Learning (SPPU) Recurrent Neural Networks, c Fig. 3.4.3, + Inthe simplest architecture, nodes are combi network and using a non-linear activato representation of nodes, their parent will a: ined into parents using a weight matrix that is shared across the whole In function such as tanh. If 4 and cz are n-dimensional vector Iso be an n-dimensional ‘vector, calculated as following. Pia = tanh(WICy, C2) Where W is a learned n x 2 n weight matrix. * This architecture, with a few improvements, has been used for successfully parsing natural scenes, syntactic parsing of natural language sentences, and recursive autoencoding and generative modelling of 3D shape structures in the form of cuboid abstractions. 3.4.2 Types of Recursive Neural Networks Recursive Neural Networks have a few variations 1. Recursive Autoencoder : Recursive autoencoders lear how to reconstruct the input. In the case of Natural Language Processing (NLP), it learns how to reconstruct contexts. A semi-supervised recursive autoencoder learns the likelihood of certain labels in each context. 2. Recursive Neural Tensor Network : It computes a supervised objective at each node of the tree. The tensor part of this means that it calculates the gradient a litle differently, factoring in more information at each node by taking advantage of another dimension of information using a tensor (a matrix of thee or more dimension) 3.4.3 Applications of Recursive Neural Networks sve and Recurrent Neural Networts share many of the Same use cases, Recurrent Neural Networks are . - —_— on NLP because of ther tes to binary trees, contexts, and natural-language-based parsers. For traditionally used in ‘able to break up a sentence into a binary tree, segmenting it by the linguistic : re example, constituency je i prof Recursive Neural Networks it a constraint that you use a pase that properties of the sentence. seas : ally constitu bus the tree structure (PIB both granular structure and higher-level hierarchical structure in datasets over ‘ * Recursive Neural Networks can SEDt ive neural networks include the folowing ations such as images or sentences. Application 1. Image scene decomposition (NLP) 2. Natural Language Processing ( 3, 3.12 Recurrent Neural Networks Deep Learning (SPPU) ‘Two specific network configurations you see in practice are recursive autoencoders and recursive neural tensors. ‘You use recursive autoencoders to break up sentences into segments for NLP. You use recursive neural tensors to break up an image into its composing objects and semantically label the objects in the scene. + Recurrent Neural Networks tend to be faster to train, thus you typically use them in more temporal applications, but they have been shown to work well in NLP-based domains such as sentiment analysis, as well. Here are a few review questions to help you gauge your understanding of this chapter. Try to attempt these ‘questions and ensure that you can recall the points mentioned in the chapter. Recurrent Neural Networks 2.1 Compare Feed-Forward Neural Networks and Recurrent Neural Networks. (4 Marks) @.2 Describe the types of RNs. (6 Marks) Long Short-Term Memory Networks (LSTM) 3 With a diagram, explain the general layout of a LSTM cell. (6 Marks) Q.4 Describe LSTM gates. (4 Marks) @.5 Explain the components of a LSTM unit. (6 Marks) @.6 Describe the advantages of LSTM. (4 Marks) Encoder Decoder Architectures @.7 Explain encoder decoder architecture. (4 Marks) 2.8 — Write a short note on Sequence to Sequence Model. (4 Marks) 9 Where would you use seq2seq and why? (4 Marks) @.10 Explain how seq2seq model works. (6 Marks) Recursive Neural Networks @.11 Explain Recursive Neural Networks. (@Merta) @.12 Explain the network architecture of Recursive Neural Networks. (6 Marka) 2.13 With a diagram, describe the diference between Recurrent Neural Network and Recursive Neural Network. (4 Marks) @.14 Describe the types of Recursive Neural Networks, sab (4 Ma @.15 Explain the applications of Recursive Neural Networks. © Types of Autoencoders © _Undercomplete Autoencoders © Regularised Autoencoders © Sparse Autoencoders © Denoising Autoencoders © Stochastic Encoders and Decoders © Contractive Autoencoders Applications of Autoencoders 4.1 Autoencoder 4 Definition : Autoencoders are artificial neural networks capable of learning efficient representations of the of input data, called codings, without any supervision. «These codings typically have a much lower dimensionality (depth of information) than the input data, making autoencoders useful for dimensionality reduction and compression. These codings, or simply, the code is a compact “summary’ or “compression’ ofthe input aso called the latent-space representation, For example, MP3 format encodes the audio file to a much smaller size than the raw .wav file. Similarly, JPG image format, when compressing images, preserves the look and feel of the images without compromising a lot on the quality. © More importantly, autoencoders act as powerful feature detectors, and they can be used for unsupervised ‘i 4 Lastly, they are capable of randomly generating new data that looks very pre-training of deep neural networks: fs r se ealled a generative model. For example, you could train an autoencoder on similar to the training data; this is calle pictures of faces, and it would then be able to generat 4.11 Basic Concept of Efficient Data Representation ofthe following number sequences do you find the easiest a Ae te new faces. '* — Suppose that I ask you, which © 40, 27, 25, 36, 81, 57, 10, 19, 58, 29, 88: 73, 19, 68 (4h, 22, VA, 34,17, 5226, 13,40, 20 W Deep Learning (SPPU) * _Atfirst glance, it would seem thatthe first sequence should be easier, since itis much shorter. However, if you look Carefully atthe second sequence, you may notice that it flows two simple rules: even numbers are followed by their half, and odd numbers are followed by ther triple plus one (this is a famous sequence known as the hailstone Sequence). Once you notice this pattern, the second sequence becomes much easier to memorise than the frst because you only need to memorse the two rues, the first number, and the length of the sequence. Isn't it beter than to memorise the entire series of numbers? * Now you could argue that that you are an expert in memorising very large sequence of numbers and hence you don't care much about the existence of a pattern in the second sequence. You would just learn every number by heart and that would be it. But, generally speaking itis the fact that it is hard to memorise long sequences and itis useful to recognize patterns. That's the basic concept behind how autoencoder works to recognise the “hidden patterns and then later use them to reconstruct the input. Autoencoders 42 Components and Architecture of Autoencoders ‘+ An autoencoder consists of 3 components. 2. Encoder : It compresses the input into a latent space representation (the code). The encoder layer encodes the input image as a compressed representation in a reduced dimension. The compressed image is the distorted version of the original image. 2 Code : This is the compressed input (from encoder) whichis fed to the decoder for reconstructing the original input later. 3. Decoder: It decades the encoded output, in the form of code, back to the original input. The decoded output isa lossy reconstruction ofthe original input, and itis reconstructed from the latent space representation (the code).The goal isto get an output as identical as was the input. * Simply speaking, fist the encoder compresses the input and produces the code, the decoder then reconstruct the input only using this code. The Fig. 4.2.1 depicts these components or layers of the neural network in an autoencoder. Input Fig. 4.2.1 W deep Learning (SPPU) 43 Fig. 42.2 oan ‘te. the code is also known as Bottleneck. This is a well-designed bs deen. served data are relevant information and what aspects can be Input Layer Hidden Layer Output Layer Training an Autoencoder ‘Autoencoders are trained the same way as ANS are trained. You need to set the following four parameters before training an autoencoder. Code size : It is the number of nodes in the middle (bottleneck) layer. Smaller size results in more it may be difficult to make the size smaller beyond a certain limit to get satisfactory results. a compression and 2. Number of layers : As yo an ANN, you need to decide how many layers autoencoder should have 3. Number of nodes per layer: You aso need to decide the number of rodes per layer ofthe autoencoder. Typically, the number of nodes per layer decreases with each subsequent layer of the encoder and inc ack in the decoder. Also the daceder is usually symmetric tothe ancodr In tarms of layer structure. fou either use mean squared error (MSE) or the range [0, 1] then you typically use cross-entropy, otherwise you use the mean wu understand, the autoencoder can be as deep as you like. Very similar to training 4, Loss (Cost) function : ¥' Fy cross-entropy as the loss function. If the input values are in squared error. Autoencoders Wedeep Learning (SPPU) 4 4.4 Features / Usage / Applications of Autoencoders ‘+ Autoencoders typically have the following features or usage. 1L._ Image colouring : Autoencoders can be used for converting black and white images into coloured images. Depending on what the image is and what are the typical colours of the objects in that image, it's possible to colour the image. 2. Feature extraction : Autoencoders extract only the required features of an image and generate the output by removing any noise or unnecessary interruption. They can also be used for compression. 3. Dimensionality Reduction : The reconstructed image is similar to the input image but with reduced dimensions (features). It helps in providing the similar image with a reduced number of pixels. 4. Denoising Image : A denoising autoencoder can be used to reconstruct the image by eliminating the noise from the input image. 4.5 _ Types of Autoencoders '* Ata high-level, autoencoders are of the following types. 1. Undercomplete 2. Regularised 8. Convolution 4. Sparse 7. Variational ‘8. Contractive Fig. 45.1 4.5.1 Undercomplete Autoencoder 45 Definition: An autoencoder whose code dimensio: ir fie mn is less than the input dimension is called undercomplete ‘¢ The simplest architecture for constructing an autoen coder is to constrain the \ number of nodes present in the hidden ee Of the network, limiting the amount of information that can flow through the en lisine ir al the network according to the reconstruction error, the model can then learn the most i Oe ete input data and how to best reconstruct the original input from an “encoded ss enportant attibuytestot the describes latent attributes of the input data, State: Ideally, this encoding leas and W Deep Learning (SPPU) Ay Copying the input to the output may sound usel less, decoder. Instead you are tinny the san an you are typically not interested in the output of the a Perform the input copying task that will result in hidden layer h taking on useful on noe Properties (obtain useful features). One way to obtain useful features from the rain h to have a ‘smaller dimension than the input x. The Fig. 4.5.2 illustrates an undercomplete autoencoder. plete ay Hidden layer Cites Input layer 4.5.2. Regularised Autoencoder «Ideally, you could train any architecture of autoencoder successful, choosing the code dimension and the capacity of the encoder and decoder based on the complasty of distribution to be modelled, Regularised autoencoders provide the ability to do so. Rather than limiting the model capacity by keeping the encoder and decoder shallow and the code size small, regularised autoencoders use alos function that encourages the model other properties besides the ably to copy its input to is output. These other properties include sparsity the derivative of the representation, and robustness to noise or to missing to have ‘of the representation, smallness of inputs, A regularised autoencoder can be nonlinesr and overcomplete but stil learn something useful about the data distribution, even if the coders such as sparse autoencoders and denoising autoencoders subsequently. model capacity is great enough to learn a trivial identity function. You wil learn about a few regularised autoen: 4.5.3 Convolution Autoencoders (CAE) a signal can be formed as a sum of other signals Convolutional Autoencoders use the convolution * Asyou know, t into a set of simple signals and then try to reconstruct the input from those signals operator to encode the input using the convolution network Wi ee sant 4xa2 Itis typically used for 1. Image reconstruction 3. Latent space clustering Fig. 4.5.3 2, Image coloutization 4, Generating higher resolution images 4.5.4 Sparse Autoencoders (SAE) saxt4xa2 One of the constraints that often leads to good feature extraction is sparsity. Using sparsity, you can push the autoencoder to reduce the number of active neurons in the coding layer. For example, it may be pushed to have on average only 5% significantly active neurons in the coding layer. This forces the autoencoder to represent each input as a combination of a small number of activations. As a result, each neuron in the coding layer typically ends up representing a useful feature (if you could speak only a few words per month, you would probably try to make them worth listening to). Input layers Si) Hon yes Ky Wi AWA xf K\ ; i Ky S =< =< SS Ss ZE St W Deep Learning (sPPU) Autoencoders re layers hel careful not to make the aute PS the autoencoder learn more complex codings. However, you must be Y you must ; encoder too powe input to a single arbitrary number Powerful. Imagine an encoder so powerful that it ust learns to map each will reconstruct the training dat - ‘he cecoder learns the reverse mapping). Obviously such aioie ; f ® Perfectly, but it will not have actually learned any useful data site _ represe tion in the '0 generalise well to new instances). It much similar to re Instances). It is very ; . simil f to you memorising the enti process (and it is unlikely ¢ The architecture of a stac i cody pea ere ae enc is typically symmetrical with regards to the central hidden layer (the Galiiceceneee looks like a sandwich. For example, an autoencoder may have 784 inputs, followed Saran ons, then a central hidden layer of 150 neurons, then another hidden layer with 300 put layer with 784 neurons. Such a stacked autoencoder is illustrated as following. 150 units 784 units Reconstructions (inputs) 784 units 4.5.6 Denoising Autoencoders (DAE) «You can force the autoencoder to learn useful features by adding nose tits inputs an then taining tt recover revents the autoencoder from trivially copying its inputs to its outputs and so it the original noise-free inputs. This PY tends up having to find patterns in the data, So, you are forcing the autoencoder to subtract the noise and produce the underlying meaningful deta, Such an autoencoder is called a denoising autoencoder. © The Fig. 4.56 illustrates a denoising autoencoder. Encoder Cecoder Noise tay tbl Code Output Original Image Deep Learning (SPPU) 48 ‘Autoencodery 4.5.7 Variational Autoencoder (VAE) ‘© Variational autoencoders are more modern and complex. They are quite different from all the autoencoders that you have learnt so far in the following respect. 1. They are probabilistic autoencoders, meaning that their outputs are partly determined by chance, even after training (as opposed to denoising autoencoders, which use randomness only during training). 2. Most importantly, they are generative autoencoders, meaning that they can generate new instances that look like they were sampled from the training set. ‘* Let's understand how it works with the help of a Fig. 4.5.7. Fig. 4.5.7 * You see the typical structure of an autoencoder but this time with a twist. Instead of directly producing a coding for a given input, the encoder produces a mean coding 4 and a standard deviation «. The actual coding is then sampled randomly from a Gaussian distribution with mean y and standard deviation o, After that, the decoder just decodes the sampled coding normally. The right pat ofthe diagram shows a training instance going through this autoencoder. First, the encoder produces H and o, then a coding is sampled randomly (notice that it is not exactly located at n), and finaly this coding is decoded, and the final output resembles the training instance. 4.5.8 Stochastic Autoencoder * In stochastic autoencoder, both the encoder and the decoder are not simple functions but instead involve some Roise injection. The output can be seen as sampled from a distribution, Pencoder(h | x) for the encoder and Pencoder’ | h) forthe decoder where h isthe code (hidden layer) and xis the input (as well as target for decoded). The output variables are treated as being conditionally independent given h so that this Probability distribution is inexpensive to evaluate, but some techniques, such as mixture density outputs, allow flexible modelling of outputs with correlations. Wr Pencoder(hls) P decover(IN) Fig. 4.5.8 ‘* So, mathematically, you can say that Stochastic Pencoder(h 19) = Prmodei(h | x) Stochastic Puecoderit|h) = Pinogai(t | h) 45.9 Contractive Autoencoder ‘The objective of a contractive autoencoder is to have @ robust learned representation which is less sensitive to small variation in the data, Robustness of the representation for the data is done by applying a penalty term to the loss function. Contractive autoencoder is another regularisation technique just ike sparse and denoising ‘autoencoders. Contractive autoencoder is a better choice than denoising autoencoder to learn useful feature extraction. The model learns an encoding in which similar inputs have similar encodings. Hence, you are forcing the model to learn how to contract a neighbourhood of inputs into a smaller neighbourhood of outputs. * You can explicitly train your model by requiring that the derivative of the hidden layer activations are small with respect to the input. In other words, for small changes to the input, you should still maintain a very similar encoded State. This is quite similar to a denoising autoencoder in the sense that these small changes to the input are essentially considered noise and that you would like your model to be robust against noise. Simply speaking, denoising autoencoders make the reconstruction function (decoder) resist small but finte-sized changes in the input, while contractive autoencoders make the feature extraction function (encoder) resist small changes in the input. © The Fig. 4.53 illustrates contractive autoencoder. —— ain observations —— _Leamed reconstruction fuction Linear identity function (perfect reconstruction) (x) Similar inputs and contracted to a constant output within a neighborhood, based on what the model observed during training Fig. 4.5.9 At the end of this unit ee bi “ ae be able to understand and comprehend the following syllabus topics: training Transfer Learning and Domain Adaption * Distributed Representation Variants of CNN : DenseNet 5.1__ Representation Learning (Feature Leaning) 4% Definition : Feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection. or classification from raw data. © Itreplaces manual feature engineering and allows a machine to both leam the features and use them to perform a specific task. Before you learn more about representation learning, let's quickly learn (recap) about feature engineering so that you understand why representation earning is extremely useful 5.1.1 Feature and Feature Engineering Before diving into feature engineering, let's take a moment ‘and do a recap of what you have leamt so far. Let's take a look at the overall machine learning pipeline. Data ns of real-world phenomena. For instance, stock market data, collection of observatio stock prices, announcements of earnings by individual ‘companies, and even it information on’ matches, environment in which those inion arti wundits, Sports data could have in vin rai oi ine nae performances, and several other observations Similarly, personal biometric data can Naha iad mee your minute-by-minute heart rate: blood sugar level, blood pressure, oxygen level, etc. aria examples of data across different domains. ‘a small window into a limited aspect of reality. The collection of all of these of the whole. But the picture Is messy because it i composed of a thousand litle ‘ent noise and missing pieces. ©The raw data, or just data is @ coll might involve observations of dally You can come up with endless «Each piece of data provides observations gives you @ picture rere’s always measurer Representation Learning hat data can help YOu ans Why do you collect data? I am sure you would say that there are several questions that (or predict). Some of the populat questions are ‘+ How likely is that a customer buying product A will also buy product B? © Which team is likely to win? ‘* How will be the weather next month? * What food you should eat to get healthier? ‘© What is the risk of getting diabetes based on your biometric data? The path from data to answers is full of false starts and dead ends. Fig. 5.1.1 ‘+ What starts out as a promising approach may not work in reality. What was originally just a hunch may end up leading to\the best solution. Workflows with data are frequently multistage and iterative processes. For instance, stock prices are observed at the exchange, aggregated by an intermediary like Thomson Reuters, stored in a database, bought by a company, converted into a Hive store on a Hadoop cluster, pulled out of the store by a Script, subsampled, massaged, and cleaned by another script, dumped to a file, and converted to a format that you «an try out in your favourite modelling library in R, Python, or Scala. The predictions are then dumped back out 2 CSV file and parsed by an evaluator, and the model is iterated multiple times, rewritten in C++ or Java by yout Production team, and run on all of the data before the final predictions are pumped out to another database. * However, if you disregard the mess of tools and systems for a moment, you might see that the process involves two mathematical entities that are at the centre of machine learning - models and features. Models * Redundant data contains multiple aspects that conv may be present as a categorical variable with values integer value between 0 and 6. If this day-of- week i got missing data on your hands. *y exactly the same information, For instance, the day of wee* of "Monday," Tuesday," "Sunday," and again included a 2” Information is not present for some data points, then you N=¥® tion: tbe @ formula ion between different aspects ‘of the data. For instance, a ‘A model that ® companys eaming history, past stock prices, and But, for most machine learning mathematic model Ks, in various computation. , features are required to be numeric so that they can be used * So, let's redefine features as, 45 Definition : A feature is a numeric representation of raw data. © Asyou know, the features in a data set are ts dime a data set having mfeatures is called an n - 7 also called jensions. So a i ing dimensional data set. For example, consider the following data set. Gender "| Marks Girt 65 Girt 46 Boy | 56 Boy 4a 53 Boy 4 Girt a2 Boy 84 Boy 4 Girl 42 Girl 40 right? Yes, this is a two-dimensional data set, or you could aying out loud that “hey look, the gender field is not ‘where feature engineering would come in where you would understand how you to something more meaningful and computationally more appropriate. For “1" for girls. Now that is numeric, isn't it? ions does it have? Two, © How many features or dimensi 2 features. I know you are 5 also say that this data set has understand, that is raw data set in value of 70" for boys and a value of numeric”. could convert this example, you could assign # Representation Learning J Momence ' 5.1.2 Feature Engineering Into numeric measurements (I just showed you one earlier), which is © There are many ways to turn raw data it features can end up looking like a lot of things. Naturally, features must derive from the type of data that i, available, Features are also tied to the model. Some models are more appropriate for some types of features, ang vice versa. The right features are relevant to the task at hand and should be easy for the model to ingest. 2S Definition : Feature engineering is the proceas of formulating the most appropriate features given the day ‘the model, and the task. The Fig 5.1.2 depicts where feature engineering sits in the machine learning pipeline. Feature Engineering Fig. 5.1.2 Features and models sit between raw data and the desired insights.'In'a machine leaming workflow, you pick not only the model, but also the features. This is a double-jointed lever, and the choice of one affects the other. Good features make the subsequent modelling step easy and the resulting model more capable of completing the desired task. Bad features may require a much more complicated model to achieve the same level of performance. The number of features is also important. If there are not enough informative features, then the model will be unable to perform the ultimate task. If there are too many features, or if most of them are irrelevant, then the model will be more expensive and trickier to train. Something might go wrong in the training process that impacts the model's performance. Feature engineering typically includes feature creation, feature transformation, feature extraction, and feature t Se 8 % Fig. 5.1.3 selection. «Feature creation identities the feay 55 tures in Feature transformation the dataset manages re that are relevant to the problem at hand. : placing © Feature extraction is the eee ei res or the dimensionality of the ‘eine svar res, © Feature sel levant re selection is the filtering of ima observing variance o correlation ‘: 5 thresh « Atahigh-level, the feature. wees lolds to 9 eating new fea "ures from existing features, typically withthe goa of reducing redundant features fre }om your dataset. This is usually d determine ‘which features to remove. ma Process looks like shown in Fig, 5.14, Fig. 5.14 Data Engineering -vs- Feature Engineering Often raw data engineering (data pre-processing) is confused with feature engineering. Data engineering is the prepared. data. Feature engineering then:tunes the prepared data to create the ‘odel. These terms have specific meanings as outlined here. process of converting raw data into features expected by the machine learning m« © Raw data (or just data) = This refers to the data in its source form, without any prior preparation for machine leaming, Note that in this contest the data might be in its raw form (in a data lake) of in a transformed form (ina ae warehouse) Transformed data in 2 data warehouse might have been converted from its original raw re but in this contest, it means that the data was not prepared specifically for your analytics eer ‘a vo ah is dition, data sent from streaming systems that eventually call machine learning models machine learn! Ps jictions is considered se ee aie refers to the dataset In the form ready for your machine learning task Data sources have : i gil rnd put into 2 tabular form. Data has been aggregated and summarized to the right see the dataset represents a unique customer, and each column represents aranularity-for example, e260 SO™ -. nates Deep Learning (SPU) = information for the customer, like the total spent in s ‘9 ax fast a, is present. relevant columns have been dropped, and invalid records have been fitereq out. Engineered features : This refers to the dataset with the tuned features expected by the model—that i, performing certain machine leaming specific operations on the columns in the prepared dataset, and creating ney features for your model during training and prediction. Some of the common examples are scaling numerical columns to a value between 0 and 1, clipping values, and one-hot-encoding categorical features. In practice, data from the same source is often at different stages of readiness. For example, a field from a table in your data warehouse could be used directly as an engineered feature. At the same time, another field in the same table might need to go through transformations before becoming an engineered feature. Similarly, data ‘engineering and feature engineering operations might be combined in the same data pre-processing step. ring and feature engineering tasks. ‘The Fig. 5.2.5, highlights the placement of data en; Feature Engineering Fig. 5.1.5 5.2 _ Introduction to Representation Learning * Now that you understand how features are used to develop machine learning models and why feature engineering is required, let’s resume our discussion on representation learning. What is a Representation? * Assume that you come across a dish by the name "chicken tortilla soup”. As soon as you read the dish name, in your mind, you had some sort of representation about what the’ words chicken tortilla soup ‘could 'mean, even though there was no mention of the ‘soup’s particular taste, ‘texture; toppings, ‘serving size, appearance, temperature, ingredients, molecular composition, or any other specifics, The 21 letters-and, spaces making uP “chicken tortilla soup" conveyed sufficient information for you to get a fair idea about what could it look and Possibly taste like, * Obviously, those 21 characters are not the soup itself that you can consume and tell what it tastes lke, They are 4 ‘epresentation of the abstract concept of a chicken tortilla soup. The representation ‘itself ‘consists only of meaningless symbols like "c’ and "h" from the Latin alphabet, organised according to the conventions of written English. The concept is an abstract one because it does not contain information specific to concrete examples of the soup -- taste, texture, toppings, etc, Deep Learning (SPPU) 57 Fig. 5.2.3 The representation makes sen representations constructed fort YOU! Only because iti coherent with the pattems of organisation of other ie mike Symbols and conventions. Combinations of the same symbols that are mies Of written English, such as “ickchenk orttaall pous” and “nekcihe ngish : same letters that chicken tortila soup hag NY Son" fePresetanyting (eventhough they have the exact If you search online for allitrot puos* are meaningless in a n resentations too, made up of the same meaningless symbols from the Latin alphabet, organised into complicated patterns of paragraphs ieee the conventions of the language. Sernces, ubaen, andl words scconeng t You could use a different set of conventions for constructing representations (eg, the conventions of French, German, or Spanish) or even a different set of symbols (e.g, logograms from Chinese Hanzi, or dots and dashes from Morse Code), but the representation of a chicken tortilla soup must be coherent with the patterns’ of organisation of all other representations constructed from the same symbols and conventions. The entire structure of written language, as it were, rests on those patterns of organisation. Indeed, those patterns must be coherent with each other for us to understand written language. This is true not only for all written languages you read with our eyes (or fingertips, in the case of Braille) but also for all other kinds of representations’ you perceive through your senses: For example, the waves of changing air pressure that reach your ears|when someone nearby says “chicken tortila Soup" must be coherent with the patterns of organisation of other sounds you hear 8thervse, you would not understand the spoken language. Similarly, the many millions of photons that hit your retinas every second you look ata chicken tortila soup must be coherent with the patterns of organisation of the. many other millions of photons that reach your eyes after bouncing off other visible objects; otherwise, you would not understand what you.see, it: must be. true for everything you think too. The neuron, activations in your brit that represent the thought "chicken tortilla soup” must be coherent with the pattems of organisation (of other neuron activations in your brain; i ele to think ofthe Soup. Hence you shouldbe able to use ather symbols instead of otherwise, you would not sent your thoughts. The notion might seem farfatched, but in fact thats precisely ee ET net symbols of written language. You represent thoughts with them! what you do wit ‘What is Representation Learning? ing" (deep or shallow it means machine fearing in which the goal is to learn arni © When you say, "representation 16 ew representation that retains information (features) aU inal representation to @ pr to transform data from its orig jhile interest to you. I essential to objects that are of er you tanstorm ve into a new representat discarding other information (features) The transformation is ‘neuron activations in your brains, representing a nearby analogous to the,manner in tion consisting of 21 characters in written English, The new Uy estaurant's chicken tortilla so ee Oe Pr Representation Leaning Deep Learning (SPPU) sntation in written English retains the essential features of icken tortilla soup (the abject of interest) ug representation it sr ther information ncuding deal spec to the nearby restaurant's particular version ofthe soup, dis is inf - scar tions in a computer consist of sequences of binary digits, orbits, typically organised into floating-poig Geen cies of bits, digital ones and zeros, as symbols for the construction of representations insg oe mers is not important for your purposes. What Is Important is that the new representations, each ong consitng of a sequence of bits, must be coherent in the sense explained earlier with the patterns of organisation other bit representations of input data transformed in the same manner. see ‘A possible representation of "chicken tortila soup" TOMO PIOOIOPIOAOOOIDIDIOG -«- pooper Ole 7 emer Meaningless symblois used to construct representations inside a computer Fig. 5.2.2 The entire structure of the new representation scheme, as it were, rests on the patterns of organisation of all possible sequences of bits representing different objects in the new scheme. Indeed, those patterns must be ‘coherent with each other for there to be a representation scheme. Need for Representation Leaning The performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied. For that reason, much of the actual effort in deploying machine leaming algorithms goes into the design of pre-processing pipelines and data’ transformations that result in a representation of the data that can support effective machine leaming. Such feature engineering is important but labour-intensive and highlights the weakness of current learning algorithms that is their inability to extract and ‘organise the discriminative and useful information from the data. Feature engineering is a way to take advantage of human creativity and prior knowledge to compensate for that weakness. In order to expand the scope and ease of applicability of machine learning, itis highly desirable to make learning algorithms less dependent on feature engineering, so that novel applications could be constructed faster. Wouldn't it be nice if you could automatically discover from the dataset what features are important? In representation learning, data is sent into the machine, and it learns the representation on its own. It is a way of determining a data representation of the features, the distance function, and the similarity function that determines how the predictive model will perform. Representation learning works by reducing high-dimensional data to low-dimensional data, making it easier to discover patterns and anomalies while also providing a better understanding of the data’s overall behaviour. How Deep Neural Networks Learn Representations (How Representation Learning Works) ‘As you know, you train deep learning neural networks to predict something for a given new input. Any deep neural Network learns to make predictions via an iterative training process, during which you repeatedly feed sample input data and gradually adjust the behaviour of all layers of neurons in the neural network so that they jointly leam to transform input data into good predictions. You get to decide what those predictions should be and also specify precisely how to measure their goodness to induce learning. each: 39 ati layer of represer it eur tation is itself an inpuy enon (8M to transform is 'PUt to subsequent {ts Inputs into: a diferent representation: This representations, Eventuall 2) 1. you ‘ predictions (however you had reach the final neo in turn, lear to transform it into yet other q learns your layer, in effec, reference inthe folowing cers to cp thre Lye’ fo ee ede. Ths ener pan lagram, turing training ‘Sample rele | = Prada, Denes 7 output Each layer lea saat amare ae Fig. 5.2.3 So, how can you make a d leep neural network lea ; = you discard the predicti IM representations and output that instead of predictions? Simple outputs replscernations eu just keep the representations that it leamt during the course of training, It cute renerematons ised of edicts when peed wh ep dt Fr eal ets at you retwork that has already learned tech a ta auitted ly leamed to make good predictions (however 9 training). You then slice off its last layer, as shown in the following figure, and that's aaa eh with a slightly shorter neural network, which, wh iy abstract representation of it instead whi hy jch, when fed input data, outputs an ion of it it Deep neural net (already trained) In this example, we discard tte original final layer that F OD ‘outputs predictions ‘Our new final layer outputs ‘abstract representations Fig. 5.2.4 You can use the representations produced by this, shorter neural network for a purpose different than, the prediction objectives you specified to induce learning in the training phase. For example, you could feed the representations as inputs to a second neural network for a different learning task. The second neural network would benefit from the representational knowledge learned ‘by the first one assuming that the prediction objectives you specified to train the first neural network induced learning of representations useful for the second one. If the main or only purpose of trainin prediction objectives are more accurate representation learning. The training objectives you ‘a neural net (eg. number, sizes, types, " representations the neural met leas. Your job is to come uP wit! that together induce learning of the kinds of seeps ee ici se, but to learn useful representations of input dst2- predictions per s& — 9 2 neural network isto learn useful representations, then the discarded ely described as training objectives. This is the essence of deep pecify to induce learning in the training phase, along withthe and arrangement of layers), determine what kinds of Lh training objectives and suitable architectures want as a by-product. The goal is not to make architecture of Representation Learning Deep Learning (SPPU) 5-10 Example — Learning Representations of Objects in Images + Let's take a simple example of representation learning to un ‘autoencoder. As you know, autoencoders are artificial neural networks ¢ of the input data, called codings, without any supervision. You train au ‘autoencoder but drop the decoding part so as to only get codings that are diagram illustrates how you can use autoencoders for representation learning Middle layer ‘tor Training: x by Predicting Reconstructions tand how it works. Assume that YoU have ay ot apable of learing efficient representation, roencoders as you would do for a regu, the representations, The following During Training: ‘Sample ‘input icaie layer Predicted Desired i ee mH Hl ew (Compression) (Generation) Weaeene ined for generating images — seers rs ‘mos comerestnd reposetaion re miler SS Fig. 5.2.5 + When you train an autoencoder with a large number of examples (say, a few million digital photos), it leams to compress input images into small representations in the middle layer that encode patterns of organisation of the Portrayed objects necessary for regenerating close approximations of those images. After training, you discard all layers following the middle one (codings). You are then left with a neural network that transforms images with millions of pixels into a small number of values that represent the objects in those images. Why and When is Deep Representation Learning Necessary? * There are three major reasons to use deep representation leaming as following. 1. Ifyou don't have enough training data. 2. Ifyou have zero examples for many categories of the objects of interest. 3. If your problem requires a model more computationally complex than feasible. + all cases, it's assumed thatthe problem at hand is complicated enough to require deep learning. What Makes a Representation Good? +The following priors (factors that are desired/assumed to be ze Present) play a ke i representation by learning a function f that maps input x to output Teas fy pe exten a good ‘or more ofthese priors to learn to output representations suited toa specific task. ere shared) across examples. atures are reused (achieved by parameters being Deep Learning (SPPU) - e shih ‘Organisation of explanatory factors + The concepts that are useful for describing the world hiera en be defined in terms of other concepts, in a hierarchy, with more abstract concepts higher in the rey. in terms of less abstract ones. This assumption is exploited with deep representations. 4)” Semi-supervised learning = with inputs x and y to predict, » subset ofthe factors explaining x’ distribution lin much of y, given x. Hence representations that are useful for p(t) tend to be useful when learning Ply x) allowing sharing of statistical strength between the unsupervised and supervised learning tasks. 5) Shared factors across tasks -+ With many y's of interest or many learning tasks in genera, tasks (eg. the corresponding ply | x, task) are explained by factors that are shared with other tasks, allowing sharing of statistical strengths across tasks. ©) Manifolds — Probability mass concentrates near regions that have a much smaller dimensionality than the ‘original space where the data lives, This is explicitly exploited in some of the autoencoder algorithms and ‘other manifold-inspired algorithms, 7) Natural clustering — Different values of categorical variables such as object classes (e.g. cats, dogs) tend to be associated with separate manifolds. Each manifold is composed of learned representation of an object class (say dog, cat). So moving along a manifold tends to preserve the value of a category (e.g. variations of dog ‘when moving on the “dog* manifold). Interpolating across object classes would require going through a low density region separating the manifolds. In essence, manifolds representing object classes tend not to overlap much. This factor is exploited in machine learning. 8) Temporal and spatial coherence —» Identifying slowly moving or changing features in temporal/spatial data could be used as a means to learn useful representations, Even though different features change at different spatial and temporal scales, the values of the categorical variables of interest tend to change slow. So this prior can be used as a mechanism to force the representations to change slowly, penalising change in values of categorical variables over time or space. 9) Sparsity + For a given observation x, only a small set of possible features are relevant. This could captured in the representation by features that are often zero or by the fact the extracted features are insensitive to variations of x. Sparse autoencoders use this prior in the form of a regularisation of the representation. 10) Simplicity of Factor Dependencies —> Ifa representation is abstract enough, the features may relate to each other through simple linear dependencies. This can be seen in many laws of physic, and ths is the prior that is assumed when stacking a simple linear predictor on top of a learned representation that is rich and abstract enough. 5.3 Greedy Layer-wise Pre-training arming played a Key historical role inthe revival of deep neural networks, enabling researchers for © Unsupervised le etwork without requiring architectural specialisations like convolution or the first time to train a deep supervised n ; recurrence. You can call this procedure unsupervised pre-training, or more precisely, greedy layer-wise insuperized pre-training. This proceaure isan established example of how a representation learned for one task rer i trying to capture the shape ofthe input astibution) can sometimes be useful for another (unsuy task (supervised learning with the same input domain)

You might also like