Science Photo Library — KTSDESIGN/Getty Images

Understanding the flow of information through LSTM cell

1. Introduction

6 min readNov 2, 2019

Several posts discussing the architecture and benefits of LSTM network are available, still understanding the processing and flow of information through the LSTM network is a complex task. This post is intended to discuss and understand the flow of information through the LSTM network. Most of the discussion is concerned with focus on elaborating the size of vector that flow through the network.

This article covers following;

LSTM architecture
Flow of information through memory cell

2. LSTM

Long Short-Term Memory (LSTM) is a class of recurrent neural networks (RNNs), other example include Gated Recurrent Unit (GRU), capable of learning long distance temporal dependencies. They are most suited for the study of problems involving time series data such as stock market prediction, decoding speech signals and in recent time for the study of biological sequences such as DNA or protein sequences. The usage of LSTM network for modelling of protein sequences have been shown in paper.

The LSTM network is a long chain of recurrent unit called memory cell, also frequently referred to as LSTM cell. Basically, the memory cell refers to the deep neural network architecture as shown in Figure 1. The memory cell contains three specialized logic gates, namely, (1) Forget logic (2) Input logic, and (3) Output logic, designed to process the sequence of information similar to the human being.

It is important to note that all the logic gates are composed of neural network layers with architecture to mimic human way of processing the sequential data.

Additionally, a small buffer element called cell state, represented as C_t, is also present in the memory cell. The cell state is a key to maintaining the global information at each time-step in the chain. As the cell state passes through the memory cell at different time-step, its content are modified using forget logic and input logic. This modification depends upon X, which is a concatenation of the output of previous hidden state h_(t_1), and the current input x_t:
X = concatenation[ h_(t-1), x_t ]

To understand the processing and flow of information through LSTM cell, assume that the dimension of each input step x_t is 32, while that of hidden state h_(t-1) is 70 as shown in Figure 1. In simple words, the input step x_t refers to the word in a sentence, where 32 is the embedding dimension for each word.

It is interesting to note that the size of vector for the hidden state, h_t and cell state, C_t at any time step t actually depends upon the hyper-parameter i.e., “number of neurons” at each layers in different gates. Layers within gates are discussed next.

The concatenation operation between h_(t-1) and x_t will generate a vector X of size 102. This is the size of vector that is input to all the gates. Note that all the values are taken only for demonstration purpose only.

2.1. Forget Gate

A forget gate is used to remove the irrelevant information from the cell state, C_(t-1), as new input x_t is encountered at t-th time step. The gate is composed of a single neural network layer with sigmoid activation function , which acts as a filter and produces a value in range [0,1] for each element in a cell state. The “number of neurons” in the layer is equal to 70, which is the dimension of a vector for each time step.

The value represents the component of each element (i.e., information) to let through.

Mathematically, it is described as:
f_t = sigmoid(W_f * X + b_f )
Here, W_f denotes weight matrix (70 x 102) and b_f denotes the bias vector (70 x 1) with subscript term f indicating forget gate.

The output from the forget gate is a vector of size 70.

2.2. Input Gate

The input logic adds new information to the cell state. It is composed of two parallel neural network layers with sigmoid and tanh activations respectively as shown in Figure 3. Note, “number of neurons” in both the layers is equal to 70. Sigmoid does filtering and tanh creates a vector of new elements.

Mathematically it is described as:
i_t = sigmoid(W_i * X + b_i)
C_t` = tanh(W_c * X + b_c)
where, W_i and W_c denotes weight matrices of size (70 x 102). Similarly, b_i and b_c denote bias vectors of size (70 x 1). The matrix notations are similar to forget gate.

The final output from the input gate is the element wise multiplication between i_t and C_t`, which is again a vector of size 70.

2.3. Update Cell State

The following equation is applied to update the information of a cell state, C_t, by removing the information filtered out using forget logic and adding new information using input logic. It is given as

The updated cell state C_t is a vector of size 70.

2.4. Output Gate

The hidden state, h_t, at each memory cell is decided based on the updated cell state, C_t, and the output vector o_t. Similar to layers in the forget gate and input gate, here also “number of neurons” is fixed to 70.

The output logic is composed of a single neural network layer having sigmoid function as a non-linear activation. This is shown in Figure 4. Size of the output vector o_t is 70. They are described as:

o_t = sigmoid(W_o * X + b_o)
Here, W_o and b_o denotes weight matrix (70 x 102) and bias vector (70 x 1) respectively with subscript term o indicating output gate.

Finally, element-wise multiplication between output o_t and cell state C_t is carried out to obtain the hidden state vector, h_t.

The hidden state h_t is a vector of size 70.

3. Python Implementation

Keras based implementation of LSTM network accepts the “number of neurons” hyper-parameter, which is also referred as number of hidden units in keras documentation as seen in code part.