LSTM for Biological Sequences

Ashish Ranjan
4 min readJan 18, 2020

--

Understanding the application of LSTM network for the analysis of biological sequences.

1. Introduction

Biological sequences such as DNA sequence and protein sequence are variable length string of characters, and play vital role while studying living organisms. This post is mainly focused on discussing the application of RNNs, specifically, LSTM network for the analysis of protein sequences. In recent time, LSTM network are widely used for the study of biological sequences such as DNA or protein sequences.

2. Natural language processing (NLP) thing with Protein Sequences

Protein sequences are “strings of amino acids,” where the arrangement of amino acids in order is widely acknowledged to be strongly linked to their physical structure, which really is essential for their existence.

Most commonly, protein sequences are assumed to be analogous to plain text sentences that are “string of words.” Similar to protein sequences, the orderly arrangement of words in a sentence is crucial to the meaning of a sentence.

LSTM networks which are frequently used with many NLP task, have also gained a widespread attention for the study of protein sequences.

In paper, LSTM network is utilized for modelling of protein sequences which is essentially a three step procedure.

  1. Protein sequence segmentation
  2. Protein segment vector construction
  3. Global protein vector construction

3. Protein sequence segmentation

In paper, authors have discussed the ill-effect of modelling complete protein sequences with LSTM network. For more detail on LSTM network refer this post. Most often only a part of protein sequence called conserved part is responsible for mapping to a particular function. Thus, modelling such sequences using LSTM network become difficult task due to dominance of non-conserved part over the relatively small conserved part. Also, LSTM network has limitations remembering very long protein sequences and suffers from high variations in sequence length.

Figure: Protein Sequence Segmentation

The above figure demonstrate the segmentation of protein sequences with overlapping among adjacent segments. Protein sequence P(i) is partitioned into a set of fixed sized segments represented as φ (i) = {p(i, 1), p(i, 2), …, p(i, j), …}.

Advantages of sequence segmentation:

  1. Reduced ill-effect of longer sequence length: Even highly sophisticated LSTM network has limitation remembering long protein sequences. Therefore, it fragments the protein sequence into small segments. This enables LSTM network to learn the local sequence segments in much efficient way when compared to relatively longer protein sequence.
  2. Data Augmentation: This also allow increasing the size of training data-set.

4. Protein segment vector construction

Each segment is modeled using LSTM network before constructing the global protein sequence vector. The neural network architecture called ProtSVG is used for learning vector for a protein segment. The ProtSVG model is composed of an embedding layer which is followed by a bi-directional LSTM layer and a fully connected dense output layer. The output has sigmoid activation to deal with multi-label scenario.

The output from the output layer, i.e. posterior probabilities corresponding to each class, is treated as the vector for a segment p(i, j) which is represented as p(i, j). The size of a vector for a segment is equal to (1 x K), where K is the number of classes.

ProtSVG : Protein Segment Vector Generator

5. Global protein vector construction

The global sequence vector is obtained by averaging all the segment vectors p(i, j) corresponding to segments p(i, j) in the set φ (i). The equation used for construction of global vector for a protein sequence is given as;

Here, P(i, s) is global vector for a protein sequence P(i) with segment size as “s”.

6. Conclusion

The segmentation based approach does excellent work while handling protein sequences of highly diversified length protein sequences. They are shown to be better at handling long protein sequences without compromising the results for short protein sequences.

References:

  1. Ranjan, A., Fahad, M.S., Fernandez-Baca, D., Deepak, A. and Tripathi, S., 2019. Deep Robust Framework for Protein Function Prediction using Variable-Length Protein Sequences. IEEE/ACM transactions on computational biology and bioinformatics, doi: 10.1109/TCBB.2019.2911609.
  2. https://medium.com/@ashish.cse16/understanding-the-flow-of-information-through-lstm-cell-4b8eee2c4c9d

--

--