# Understanding the flow of information through LSTM cell

**1. Introduction**

Several posts discussing the architecture and benefits of LSTM network are available, still understanding the processing and flow of information through the LSTM network is a complex task. This post is intended to discuss and understand the flow of information through the LSTM network. Most of the discussion is concerned with focus on elaborating the size of vector that flow through the network.

This article covers following;

- LSTM architecture
- Flow of information through memory cell

**2. LSTM**

Long Short-Term Memory (LSTM) is a class of recurrent neural networks (RNNs), other example include Gated Recurrent Unit (GRU), capable of learning long distance temporal dependencies. They are most suited for the study of problems involving time series data such as stock market prediction, decoding speech signals and in recent time for the study of biological sequences such as DNA or protein sequences. The usage of LSTM network for modelling of protein sequences have been shown in paper.

The LSTM network is a long chain of recurrent unit called

memory cell, also frequently referred to asLSTM cell. Basically, thememory cellrefers to the deep neural network architecture as shown in Figure 1. Thememory cellcontains three specialized logic gates, namely, (1)Forget logic(2)Input logic, and (3)Output logic,designed to process the sequence of information similar to the human being.

It is important to note that all the logic gates are composed of neural network layers with architecture to mimic human way of processing the sequential data.

Additionally, a small buffer element called

cell state, represented asC_t, is also present in the memory cell. Thecell stateis a key to maintaining the global information at each time-step in the chain. As the cell state passes through the memory cell at different time-step, its content are modified usingforget logicandinput logic. This modification depends upon, which is a concatenation of the output of previous hidden stateXh_,(t_1)and the current inputx_t:X = concatenation

[h_(t-1),x_t]

To understand the processing and flow of information through LSTM cell, assume that the dimension of each input step ** x_t** is 32, while that of hidden state

**is 70 as shown in Figure 1. In simple words, the input step**

*h_*(*t-1*)**refers to the word in a sentence, where 32 is the embedding dimension for each word.**

*x_t*It is interesting to note that the size of vector for the hidden state, ** h_t** and cell state,

**at any time step**

*C_t**t*actually depends upon the hyper-parameter i.e., “

**number of neurons**” at each layers in different gates. Layers within gates are discussed next.

The concatenation operation between ** h_(t-1)** and

*x_t**will generate a vector*

**X**of size 102. This is the size of vector that is input to all the gates. Note that all the values are taken only for demonstration purpose only.

**2.1. Forget Gate**

A ** forget gate** is used to remove the irrelevant information from the

*cell state,***as new input**

*C_*(*t-1*)*,***is encountered at**

*x_t**t-*th time step. The gate is composed of a single neural network layer with sigmoid activation function , which acts as a filter and produces a value in range [0,1] for each element in a cell state. The “

**number of neurons**” in the layer is equal to 70, which is the dimension of a vector for each time step.

The value represents the component of each element (i.e., information) to let through.

Mathematically, it is described as:

f_t= sigmoid(W_f*X+b_f)Here,

denotes weight matrix (70 x 102) andW_fdenotes the bias vector (70 x 1) with subscript termb_ffindicating forget gate.

The output from the forget gate is a vector of size 70.

**2.2. Input Gate**

The ** input logic** adds new information to the

**. It is composed of two parallel neural network layers with**

*cell state**sigmoid*and

*tanh*activations respectively as shown in Figure 3. Note, “

**number of neurons**” in both the layers is equal to 70.

*Sigmoid*does filtering and

*tanh*creates a vector of new elements.

Mathematically it is described as:

i_t= sigmoid(W_i* X+b_i)

C_t`= tanh(W_c*X+b_c)where,

W_iandW_cdenotes weight matrices of size (70 x 102). Similarly,b_iandb_cdenote bias vectors of size (70 x 1). The matrix notations are similar to forget gate.

The final output from the input gate is the element wise multiplication between ** i_t** and

**, which is again a vector of size 70.**

*C_t`***2.3. Update Cell State**

The following equation is applied to update the information of a *cell state,*** C_t, **by removing the information filtered out using

**and adding new information using**

*forget logic***. It is given as**

*input logic*The updated cell state ** C_t **is a vector of size 70.

**2.4. Output Gate**

The hidden state, ** h_t,** at each memory cell is decided based on the updated

*cell state,***and the output vector**

*C_t,***. Similar to layers in the**

*o_t***and**

*forget gate***, here also “**

*input gate***number of neurons**” is fixed to 70.

The ** output logic** is composed of a single neural network layer having sigmoid function as a non-linear activation. This is shown in Figure 4. Size of the output vector

**is 70. They are described as:**

*o_t*

o_t= sigmoid(W_o*X+b_o)Here,

W_oandb_odenotes weight matrix (70 x 102) and bias vector (70 x 1) respectively with subscript termoindicating output gate.

Finally, element-wise multiplication between output ** o_t** and cell state

**is carried out to obtain the hidden state vector,**

*C_t***.**

*h_t*The hidden state ** h_t** is a vector of size 70.

**3. Python Implementation**

Keras based implementation of LSTM network accepts the “**number of neurons**” hyper-parameter, which is also referred as number of hidden units in keras documentation as seen in code part.

The model summary reports that final output from LSTM layer is a vector of size 140 due to its bidirectional implementation.

**4. Conclusion**

This post discusses the flow of information in terms of vector through the LSTM cell.

This post was inspired by them.

- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://ashish-cse16.medium.com/understanding-the-flow-of-information-through-gru-cell-198655bc7074
- Ranjan, A., Fahad, M.S., Fernandez-Baca, D., Deepak, A. and Tripathi, S., 2019. Deep Robust Framework for Protein Function Prediction using Variable-Length Protein Sequences.
*IEEE/ACM transactions on computational biology and bioinformatics*.