Understanding the flow of information through GRU cell
- Introduction:
Understanding the working mechanism of any neural network architecture requires knowledge of processing and flow of information through the network. The article is written to explain the flow of information through the GRU cells.
This article covers the following:
- GRU architecture
- Flow of information through the GRU cell
2. Gated Recurrent Unit - An Introduction:
Gated Recurrent Unit (GRU) is a class of deep recurrent neural networks (RNNs), sharing capabilities very much parallel to the Long Short Term Memory (LSTM) networks. The special thing about them is -- both the network can efficiently learn “temporal dependencies” over the long distances, which is not the case with the standard RNNs. These networks avoid the “vanishing gradient” problem that restricts learning the long distance temporal dependencies.
They are most suited for the study of domains with implicit characteristics such as temporal dependency — where order between the words matters the most. Common examples include, stock market prediction, decoding speech signals and study of biological sequences such as DNA or protein sequences.
GRU network is a long chain of recurrent units called “GRU cells”. When compared to the LSTM networks the GRUs are composed of only two gates, namely (i) reset gate and (ii) update gate, and contains no buffer element/cell state -- this essentially helps reduce complexity of the architecture.
Important to note that that these gates are nothing but a neural network layers with special activation functions, designed to process the sequence of information similar to the human being.
Assume that at any time step, the dimension of the input vector, x_t, through the GRU cell is 32 bit, while that of hidden state h(t-1) is 70 bit as shown in Figure 1. The concatenation operation between h(t-1) and x_t will generate a vector X of size 102. This is the size of vector that is input to both the gates. Note that these values are only for demonstration purpose.
X = concatenation[h(t-1), x_t]
The initial condition of the GRU cell is shown in Figure 1.
It is interesting to note that the size of vector for the hidden state, h_t, at any time step t actually depends upon the hyper-parameter i.e., “number of neurons” at each layers in different gates. Layers within gates are discussed next.
3. Reset Gate:
Reset gate helps determining the magnitude of past knowledge to forget.
It is composed of a single neural network layer with sigmoid activation function , which acts as a filter and produces a value in range [0,1]. Assuming the “number of neurons” in the layer (shown with box in light blue color)is equal to 70, which is the dimension of a vector for each time step.
The value represents the component of each element (i.e., information) to forget.
Mathematically, it is described as:
r_t = sigmoid(W_r * X + b_r )
Here, W_r denotes weight matrix (70 x 102) and b_r denotes the bias vector (70 x 1) with subscript term r indicating reset gate.
The output from the reset gate is a vector of size 70.
4. Update Gate:
The update gate helps decide the past information to move forward. This is a great step, carrying forward the information from past, thereby reducing the risk of vanishing gradient problem.
Mathematically, it is described as:
z_t = sigmoid(W_z * X + b_z )
Here, W_z denotes weight matrix (70 x 102) and b_z denotes the bias vector (70 x 1) with subscript term z indicating update gate.
The mathematical notation for this is very much similar to that of the reset gate. However, the way the outputs from these gates are utilized marks the difference between the two.
5. Current Memory Content:
Till this point the output from the reset gate is utilize to decide upon the “current memory content” by forgetting the irrelevant past information, and then sub-sequentially utilizing the current information, x_t, to generate the current memory content, h_t`.
Mathematically, it is described as:
Here, W denotes weight matrix (70 x 102) and b denotes the bias vector (70 x 1). The hidden state h_t` is a vector of size 70
6. Final Hidden State:
The final hidden state vector from each GRU cell is obtained based on the vector obtained from the previous hidden state, update gate and the current memory content.
The update gate, z_t, decides what is collect from the previous time-step, h_(t-1), and what from the current memory content, h_t`. This is done using equation as follows:
If the current information content is most important, then the (1 - z_t) is high to emphasize focus on the current memory content, neglecting most of the past information.
Else-wise, if the information from past is important, then the z_t is high to bring the past information .
7. Conclusion
This post discusses the flow of information in terms of vector through the GRU cell.