Long Short Term Memory Networks in Python
Vanilla LSTM - Memory cells of a single LSTM layer used in simple network structure
Stacked LSTM - Deep networks as LSTM layers are stacked one on the top of another
CNN LSTM - A convolutional NN used to learn features in spatial input like images and to generate image responses.
Encoder-Decoder LSTM - Separate networks of LSTM's that encode input and decode output sequences
Bidirectional LSTM - Input sequences are presented and learned both forward and backward
Generative LSTM - LSTM's learn the structure relationship in input sequences so well that they can generate new plausible output sequences
Sequence prediction problems
Sequence prediction is a type of supervised learning problem. The sequence imposes an order on the observations that must be preserved when training models and making predictions. We will examine four different types of sequence prediction problems:
1. Sequence Prediction.
2. Sequence Classification.
3. Sequence Generation.
4. Sequence-to-Sequence Prediction.
Often we deal with sets in applied machine learning such as a train or test set of samples. Each sample in the set can be thought of as an observation from the domain. In a set, the order of the observations is not important. A sequence is different. The sequence imposes an explicit order on the observations. The order is important. It must be respected in the formulation of prediction problems that use the sequence data as input or output for the model.
Sequence Prediction
Sequence prediction is sequence learning.
Examples:
Weather Forecasting – Given a sequence of observations about the weather over time,
predict the expected weather tomorrow.
Stock Market Prediction – Given a sequence of movements of a security over time,
predict the next movement of the security.
Product Recommendation – Given a sequence of past purchases for a customer, predict
the next purchase for a customer.
Appendix:
API Documentation and White papers:
https://arxiv.org/
https://scholar.google.com/
Sequence Classification
Sequence classification involves predicting a clkass label for a given input sequence, e.a.
1,2,3,4,5
good
Examples:
DNA Sequence Classification. Given a DNA sequence of A, C, G, and T values,
predict whether the sequence is for a coding or non-coding region.
Anomaly Detection. Given a sequence of observations, predict whether the sequence is
anomalous or not.
Sentiment Analysis. Given a sequence of text such as a review or a tweet, predict
whether the sentiment of the text is positive or negative.
Sequence Generation
Sequence generation involves generating a new output sequence that has same general characteristics as other sequence in the corpus
1,3,5 and 7,9,11 has output 3,5,7
RNN can be trained for sequence generation by processing real data sequences one step at a time and predicting what comes next. Assuming the predictions are probabilistic, novel sequences can be generated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step. In other words by making the network treatits inventions as if they were real, much like a person dreaming.
Examples:
Text Generation. Given a corpus of text, generate new sentences or paragraphs of text that read they could have been drawn from the corpus.
Handwriting Prediction. Given a corpus of handwriting examples, generate handwrit-
ing for new phrases that has the properties of handwriting in the corpus.
Sequence Prediction Problems
Music Generation. Given a corpus of examples of music, generate new musical pieces
that have the properties of the corpus.
Sequence generation may also refer to the generation of a sequence given a single observation as input. An example is the automatic textual description of images.
Image Caption Generation. Given an image as input, generate a sequence of words
that describe an image. Something like “Sasha riding motorcycle image” and output sequence is “man riding a motorcycle”
Sequence to Sequence Prediction
Predict output sequence from given input sequence.
1,2,3,4,5
6,7,8,9,10
Despite their flexibility and power, [deep neural networks can only be applied to
problems whose inputs and targets can be sensibly encoded with vectors of fixed
dimensionality. It is a significant limitation, since many important problems are best
expressed with sequences whose lengths are not known a-priori. For example, speech
recognition and machine translation are sequential problems. Likewise, question
answering can also be seen as mapping a sequence of words representing the question to a sequence of words representing the answer.
Sequence-to-sequence prediction is a subtle but challenging extension of sequence prediction, where, rather than predicting a single next value in the sequence, a new sequence is predicted that may or may not have the same length or be of the same time as the input sequence. This type of problem has recently seen a lot of study in the area of automatic text translation (e.g. translating English to French) and may be referred to by the abbreviation seq2seq. seq2seq learning, at its core, uses recurrent neural networks to map variable-length input sequences to variable-length output sequences. While relatively new, the seq2seq approach has achieved state-of-the-art results in […] machine translation.
If the input and output sequences are a time series, then the problem may be referred to as multi-step time series forecasting. Some examples of sequence-to-sequence problems include:
Multi-Step Time Series Forecasting. Given a time series of observations, predict a
sequence of observations for a range of future time steps.
Text Summarization. Given a document of text, predict a shorter sequence of text that
describes the salient parts of the source document.
Program Execution. Given the textual description program or mathematical equation
predict the sequence of characters that describes the correct output.
Limitations of Multilayer Perceptrons
Classical neural networks called Multilayer Perceptrons, or MLPs for short, can be applied to sequence prediction problems. MLPs approximate a mapping function from input variables to output variables. This general capability is valuable for sequence prediction problems (notably time series forecasting) for a number of reasons.
Robust to Noise. Neural networks are robust to noise in input data and in the mapping
function and can even support learning and prediction in the presence of missing values.
Nonlinear. Neural networks do not make strong assumptions about the mapping function and readily learn linear and nonlinear relationships.
More specifically, MLPs can be configured to support an arbitrary defined but fixed number of inputs and outputs in the mapping function. This means that:
Multivariate Inputs. An arbitrary number of input features can be specified, providing
direct support for multivariate prediction.
Multi-Step Outputs. An arbitrary number of output values can be specified, providing
direct support for multi-step and even multivariate prediction. This capability overcomes the limitations of using classical linear methods (think tools like
ARIMA for time series forecasting). For these capabilities alone, feedforward neural networks are widely used for time series forecasting. One important contribution of neural networks – namely their elegant ability to approximate arbitrary nonlinear functions. This property is of high value in time series processing and promises more powerful applications, especially in the subfeld of forecasting …
MLP’s have five critical limitations:
Stateless MLPs learn a fixed function approximation. Any outputs that are conditional on the context of the input sequence must be generalized and frozen into the network weights.
Unaware of Temporal Structure Time steps are modeled as input features, meaning
that network has no explicit handling or understanding of the temporal structure or order between observations.
Messy Scaling For problems that require modeling multiple parallel input sequences,
the number of input features increases as a factor of the size of the sliding window without any explicit separation of time steps of series.
Fixed Sized Inputs The size of the sliding window is fixed and must be imposed on all
inputs to the network.
Fixed Sized Outputs The size of the output is also fixed and any outputs that do not
conform must be forced.
The Long Short-Term Memory Network
The computational unit of the LSTM network is called the memory cell, memory block, or just cell for short. The term neuron as the computational unit is so ingrained when describing MLPs that it too is often used to refer to the LSTM memory cell. LSTM cells are comprised of weights and gates.
The Long Short Term Memory architecture was motivated by an analysis of error
flow in existing RNNs which found that long time lags were inaccessible to existing
architectures, because backpropagated error either blows up or decays exponentially.
An LSTM layer consists of a set of recurrently connected blocks, known as memory
blocks. These blocks can be thought of as a differentiable version of the memory
chips in a digital computer. Each one contains one or more recurrently connected
memory cells and three multiplicative units – the input, output and forget gates –
that provide continuous analogues of write, read and reset operations for the cells.
… The net can only interact with the cells via the gates.
LSTM Weights
A memory cell has weight parameters for the input, output, as well as an internal state that is built up through exposure to input time steps.
Input Weights. Used to weight input for the current time step.
Output Weights. Used to weight the output from the last time step.
Internal State. Internal state used in the calculation of the output for this time step.
LSTM Gates
The key to the memory cell are the gates. These too are weighted functions that further govern the information flow in the cell. There are three gates:
Forget Gate: Decides what information to discard from the cell.
Input Gate: Decides which values from the input to update the memory state.
Output Gate: Decides what to output based on input and the memory of the cell.
We can summarize the 3 key benefits of LSTMs as:
Overcomes the technical problems of training an RNN, namely vanishing and exploding gradients.
Possesses memory to overcome the issues of long-term temporal dependency with input sequences.
Process input sequences and output sequences time step by time step, allowing variable length inputs and outputs.
Applications of LSTMs
Automatic Image Caption Generation
Automatic Translation of Text
Automatic Handwriting Generation
Limitations of LSTMs
LSTMs are very impressive. The design of the network overcomes the technical challenges of RNNs to deliver on the promise of sequence prediction with neural networks. The applications of LSTMs achieve impressive results on a range of complex sequence prediction problems. But LSTMs may not be ideal for all sequence prediction problems. For example, in time series forecasting, often the information relevant for making a forecast is within a small window of past observations. Often an MLP with a window or a linear model may be a less complex and more suitable model.