Recurrent Neural Network for Predicting Transcription Factor Binding Sites

Shen, Zhen; Bao, Wenzheng; Huang, De-Shuang

doi:10.1038/s41598-018-33321-1

Download PDF

Article
Open access
Published: 15 October 2018

Recurrent Neural Network for Predicting Transcription Factor Binding Sites

Scientific Reports volume 8, Article number: 15270 (2018) Cite this article

13k Accesses
125 Citations
2 Altmetric
Metrics details

Subjects

Abstract

It is well known that DNA sequence contains a certain amount of transcription factors (TF) binding sites, and only part of them are identified through biological experiments. However, these experiments are expensive and time-consuming. To overcome these problems, some computational methods, based on k-mer features or convolutional neural networks, have been proposed to identify TF binding sites from DNA sequences. Although these methods have good performance, the context information that relates to TF binding sites is still lacking. Research indicates that standard recurrent neural networks (RNN) and its variants have better performance in time-series data compared with other models. In this study, we propose a model, named KEGRU, to identify TF binding sites by combining Bidirectional Gated Recurrent Unit (GRU) network with k-mer embedding. Firstly, DNA sequences are divided into k-mer sequences with a specified length and stride window. And then, we treat each k-mer as a word and pre-trained word representation model though word2vec algorithm. Thirdly, we construct a deep bidirectional GRU model for feature learning and classification. Experimental results have shown that our method has better performance compared with some state-of-the-art methods. Additional experiments about embedding strategy show that k-mer embedding will be helpful to enhance model performance. The robustness of KEGRU is proved by experiments with different k-mer length, stride window and embedding vector dimension.

GMean—a semi-supervised GRU and K-mean model for predicting the TF binding site

Article Open access 30 January 2024

Chai Wen Chuah, Wanxian He & De-Shuang Huang

BindSpace decodes transcription factor binding signals by large-scale sequence embedding

Article 12 August 2019

Han Yuan, Meghana Kshirsagar, … Christina S. Leslie

Deep neural networks identify sequence context features predictive of transcription factor binding

Article 18 January 2021

An Zheng, Michael Lamkin, … Melissa Gymrek

Introduction

Transcription factors (TFs) play a critical role in gene expression. It can control genetic information transmission from DNA to messenger RNA by binding to a specific area in DNA sequence^1,2,3. The mutations of TF binding sites and its adjacent have great influence on gene expression, and then increase the risk of complex disease^4,5,6,7,8. It is no doubt that detailed analysis of the TF binding is significant to the further study of gene expression. Many related researches, like digging TF binding sites and exploring the effect of loci mutation on TF binding, have been done in the lab. However, biological experiments for TF binding are expensive and time-consuming. With the development of high-throughput sequencing technology, more and more biological datasets have been proposed^9,10,11. Therefore, the principal mission of researchers is to develop a computational model to infer the underlying binding rules and identify TF binding sites without prior information from these datasets.

At the beginning of this study, many computational models, which were used to describe TF binding preference, are proposed based on position weight matrices (PWMs) or motifs^{12,13,14,15,16,17,18}. These models don’t consider the effect of other sequence features on TF binding, such as low-affinity binding sites, flanking DNA, sequence GC bias, and so on^{19,20,21,22,23,24}. Moreover, ChIP-seq data not only contain TF binding information, but also have the aforementioned sequence features. Therefore, many ChIP-seq-based computational models have been proposed and have better performance than previous models^{14,20,25,26,27,28,29,30,31,32}. For example, unlike original kmer-SVM, Mahmoud et al.³³ extracted gapped k-mer feature from ChIP-seq and trained a SVM classifier, which would be used to predict functional genomic regulatory elements and tissue-specific enhancers. This change has greatly improved the prediction performance of original model. In addition, some computational models have made a comprehensive application of ChIP-seq and DNase-seq^{27,34,35,36,37,38,39,40,41}, which ensure the prediction accuracy of these models. What’s more, some researchers also proposed related models to identify functional modules from a whole-genome sequence^42,43, like promoter⁴⁴, enhancer^45,46, and recombination spots⁴⁷. For instance, a two-layer predictor⁴⁵, named ‘iEnhancer-2L’, was developed to identify enhancers and their strength by pseudo k-tuple nucleotide composition. Liu et al.⁴⁷ proposed an ensemble learning approach to identify recombination spots by using multi-modal feature obtained from genome sequence.

Due to the reason that deep learning^48,49 can learn feature information directly from huge amounts of data, it has developed very rapidly in the past few years and has been widely used in computer vision⁵⁰, image analysis^{51,52,53,54,55}, speech recognition⁵⁶, natural language processing (NLP)⁵⁷, and others^58,59. Recently, researchers used deep learning to extract gene regulation information from DNA sequences^{38,60,61,62,63}. For example, Babak et al.²⁸ proposed a model based on deep convolutional neural networks (CNN), named DeepBind, to predict the sequence specificities of DNA- and RNA- binding protein. This model has achieved better performance than other existing methods. Zeng et al.⁶⁴ made a systematic exploration of CNN application in DNA-protein binding. The characteristic of this study is that they used a flexible cloud-based framework to achieve the rapid exploration of alternative neural network architectures. These CNN-based models have achieved better performance, but we also note that CNN only focus on the current state and cannot capture the influence of previous state and future state on current state. To address this problem, Quang et al.⁶⁵ proposed a hybrid convolutional and recurrent neural network framework for predicting the function of short DNA sequence. Since recurrent neural networks (RNN)⁶⁶ can effectively extract feature information from time-series data, it has been widely used in the process of sequence data, like text classification, video description. In this paper, we use Bidirectional Gated Recurrent Units (GRU)⁶⁷ network to extract feature information from DNA sequence, and then predict TF binding sites by using the feature information.

Neural networks cannot be used for text analysis directly unless we transform text data into the specific format^68,69,70. Therefore, word embedding was proposed to solve the defect of one-hot, which can’t reflect the distribution characteristic of text data. For word embedding, word corpus, specific language model and feature learning should be finished at first, and then words or phrases from the corpus are mapped to vectors of real numbers^71,72,73,74. In this paper, k-mer is considered as a word in the sentence, so DNA sequences are divided into a k-mer series with a specified length and stride window. These k-mer datasets would be sent into Bidirectional GRU network for feature learning and classification.

Here, we proposed KEGRU, a novel computational method for predicting DNA-protein binding sties. In KEGRU, a DNA sequence is divided into a k-mer sequence with a specified length and stride window at first. And then, the k-mer sequence is mapped into D-dimensional vector space by word2vec. Thirdly, we use BiGRU to learn features from k-mer sequences and give prediction result. To evaluate the performance of our model, we chose four cell line TF binding datasets HESC, A549, HUVEC and MCF7 from the Encyclopedia of DNA Elements (ENCODE)⁷⁵ project. Experiment results show that our model has better performance than other competing methods on the task of predicting TF binding sites. We prove that k-mer embedding is helpful for the transformation of k-mer sequence. We verify the influence of different k-mer length, slide window and vector length. We also compare the performance of KEGRU with three baseline methods: gkmSVM, DeepBind and CNN_ZH⁶⁴. We hope that our method could contribute to the study of DNA sequence modeling and DNA regulatory mechanisms.

Results

In this section, we used the Keras platform to implement the KEGRU model. A series of experiments were performed to evaluate the performance of our model. For simplicity, we call a CNN-based model, which was proposed by Zeng et al.⁶⁴, as CNN_ZH in this paper. We compared our model with gkmSVM, DeepBind, and CNN_ZH. We evaluated the efficacy of k-mer embedding strategy. We also evaluated the robustness of our model with different k-mer length, stride window, and embedding vector dimension.

AUC (the area under the receiver operating characteristic curve) was used in this paper to evaluate the performance of our model. As a common evaluation metric, AUC is widely used in machine learning and motif discovery. AUC represents a probability, which is generated by a classifier that will rank a randomly chosen positive instance higher than a randomly chosen negative one. In addition, we also used average precision score (APS) to evaluate the performance of our model. APS summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.

Experiment setup

To evaluate the performance of our model, we used 125 TF binding sites ChIP-seq experiments from the ENCODE project, including A549, MCF-7, H1-HESC and HUVEC. For each cell type, the centered 101 bps were chosen as positive samples from each record in peak file. To meet model test requirement, equal numbers of negative samples were generated by matching the size, GC-content and repeat fraction of the positive sample. Each dataset was randomly divided into three groups: training, validation and test sets.

For the training of k-mer embedding model, a k-mer corpus was generated by setting k to 5, and the stride s to 2. We used the python implementation of the word2vec model in Gensim package to obtain the k-mer embedding vectors. All parameters in word2vec were left at their default values.

Hyper-parameter

The hyper-parameter setting in our methods consists of two groups: model-related and data-related. For model-related, it contains 12 parameter settings, which are generated by the combination of different optimizer and GRU number. Details of the model-related hyper-parameter setting are summarized in Table 1. For each ChIP-seq dataset, we execute all hyper-parameters, and record the performance of different parameter settings and get the best parameter setting by comparing the performance of all datasets. The hyper-parameter of three baseline models, gkmSVM, DeepBind, and CNN_ZH, remain unchanged.

Table 1 Hyper-parameter settings.

Full size table

The whole process is made up of two steps: model training and data statistical analysis. In model training, for each ChIP-seq dataset, we used all hyper-parameters to train KEGRU on the training set with a mini-batch size of 200 and tested it on the validation set. After training, all trained model should be tested on the corresponding test set and record test results. In data statistical analysis, all test results are classified by cell type, hyper-parameter setting and test indicators. The reason for this is that TF may have different binding properties on different cell type and a hyper-parameter setting may have better performance on one cell type and opposite case on other cell types. We would have the best hyper-parameter on specific cell type by comparing the test results.

Although the word embedding has been widely used in NLP, the effect of different embedding strategy on TF binding site prediction is unknown. In addition, we still don’t know the influence of k-mer length, stride window and embedding vector dimension on model performance. Therefore, these settings were also regarded as hyper-parameter and would be discussed later in following section.

Performance comparison with existing methods

To demonstrate the performance of our model KEGRU and compare it with three baseline models gkmSVM, DeepBind and CNN_ZH, we performed a series of experiments with the different hyper-parameter setting. Figure 1 is the concentrated display of the AUC and APS distribution of four models on four cell types with various hyper-parameter settings, highlighting the excellent performance of KEGRU. Table 2 displays the average APS of KEGRU compared with three baseline models. Table 3 displays the average AUCs of KEGRU compared with three baseline models. Given the above information, we chose the best hyper-parameter setting and used scatter plots to display the performance gap of KRGRU and three baseline models. Figure 2(a) shows that KEGRU has higher performance than gkmSVM on four cell line. Figure 2(b,c) show the performance comparison of KEGRU and other two baseline model, respectively. As shown in Fig. 2, our model KEGRU always performs better than three baseline models. In general, with the help of BiGRU and k-mer embedding, our model KEGRU has higher AUC scores than other three baseline models, which means that our model would be helpful for the motif discovery task.

Table 2 Average APS scores across 125 ChIP-Seq datasets.

Full size table

Table 3 Average AUC scores across 125 ChIP-Seq datasets.

Full size table

Evaluate the effect of k-mer embedding

Although word embedding has proven its efficacy in NLP, it is rarely used in human genomic research⁷⁶. The role of the embedding layer in our model is to map the k-mer index to k-mer vectors obtained by the pre-trained k-mer model. Previous experiments have shown that when using k-mer embedding and BiGRU, KEGRU has better performance than other three baseline models. But, the question is that we still don’t know how much the effect of the embedding strategy on model performance is. To address this question, we designed three embedding ways: no-init, init-no-train, and init-train. No-init means that the weights of the embedding layer are initialized with uniform distribution and the neural network can adjust the weights during model training. Init-no-train means that the weights of the embedding layer are initialized by the pre-trained k-mer vectors and will not be updated during model training. Init-train means that the weights of the embedding layer are initialized by the pre-trained k-mer vectors and will be fine-tuned during training. This strategy is adopted in our model KEGRU.

Figure 3 displays the average APS of the above three embedding strategies on four datasets. The contrast examination shows that when using pre-trained k-mer vectors, the impact of fine-tuning is minor to the performance of our model. The k-mer embedding strategy has got the desired improvements in the model performance. In conclusion, the pre-trained k-mer vectors could reflect the distribution characteristics of k-mer in DNA sequence well, and help improve model performance.

Sensitivity analysis

In this section, we used the HUVEC dataset to analyze the effect of k-mer length k, stride window s and embedding dimension d. With the increase of k, the capacity of k-mer vocabulary would be explosive growth. Besides, small k can’t reflect the characteristics of TF binding. In this paper, we reconstructed the k-mer corpus with different k (from 4 to 6) and obtained the k-mer embedding vectors by re-training k-mer model. Figure 4(a) shows that our model has got close prediction performance with different k.

The number of k-mer N is determined together by DNA sequence length L, k-mer length k and stride window s.

$$N=\lfloor (L-k)/s\rfloor +1$$

(1)

Figure 4(b) shows that when using four different stride window s = 2, 3, 4 and 5, the performance of our model is decreasing. As shown in Equation (1), the size of the k-mer corpus will be decreased by a larger s, and can lead to the lack of useful information. This may have negative impact on the embedding representation. We don’t specify the stride window as 1 because it may sharply increase the size of the k-mer corpus and make a k-mer largely overlap with its neighbor. This also has great influence on the embedding representation. In general, to make the embedding layer work well, a proper stride window s 2 is recommended here.

In addition, we also analyzed the effect of different embedding dimension d, including 50, 100, 150 and 200. The model complexity will be increased by more weight parameters, which is caused by a larger d and needs to be learned in embedding layer. As shown in Fig. 4(c), although changes in embedding dimension d may lead to overfitting, the prediction accuracy of our model remains fairly constant throughout the experiment.

Discussion

In this paper, we propose a bidirectional gated recurrent unit neural network with k-mer embedding to identify TF binding sites from DNA sequence. The characteristic of our model is summarized as follows. Firstly, word embedding has been introduced in our model and applied to k-mer sequence representation using the unsupervised learning algorithm word2vec. To avoid dimension curse caused by one-hot encoding, the k-mer embedding vectors are used for feature representation. This measurement is beneficial to following feature learning and classification task. Secondly, our model KEGRU is suitable to processing variable-length input sequences. The shortcoming of CNNs is that the length of input sequence must be fixed. The application of BiGRU network not only improves the compatibility of variable length input sequence, but also is able to capture complex context information from the k-mer sequence. In addition, we prove that our model has better prediction performance than other baseline methods. We also prove that k-mer embedding is an effective method to improve the performance of our model. What’s more, we show the robustness of our model through hyper-parameter experiments. We also show the role of NLP and RNN in the DNA sequence analysis.

There are still some works to be done in the future to improve the performance of our model. First, in the double-stranded DNA sequences, domain-specific modifications may appear identically on one strand or its reverse complement. Therefore, GRU with reverse complement mechanism is helpful to feature learning and classification task. Second, the attention mechanism is also an excellent choice. Attention mechanism has been used in document classification and sentiment classification and achieved better performance. The basic idea of attention mechanism is that it can focus on the key parts of the whole sequence. This characteristic may be used in our model to help explore the important TF binding sites on DNA sequences. In NLP, there are some existing methods to embed sentence or documents into a vector directly by sentence2vec and paragraph2vec. Therefore, we can design a new embedding algorithm for the representation of the variable-length k-mer sequence. Finally, we hope that our method would contribute to the study of gene regulation mechanism.

Methods

In this section, we describe the basic structure of KEGRU at first. Then, we discuss the detail information of word embedding in our model, which is used to represent a k-mer as a low-dimensional vector. At last, we used Bidirectional GRUs to capture long range dependencies and form fixed-length feature representation of arbitrary-length DNA sequences.

Model architecture

Given the k-mer length k and stride widow s, a DNA sequence with L_D base pairs will be split into a k-mer sequence KS with length ${L}_{k}=\lfloor ({L}_{D}-k)/s\rfloor +1$. Each k-mer in KS is indexed by positive integers in $K=[1,2,\cdots ,N]$. Then, we will explore an appropriate approach to learn the co-occurrence information of k-mer in KS, which will help map k-mer sequence into a vector space V.

In KEGRU, each DNA sequence is given a binary label, which represents whether the short DNA sequence is a TF binding region or not. Suppose that we have M labeled instances ${\{{x}_{i,}{y}_{i}\}}_{i=1}^{N}$, where ${x}_{i}\in {K}^{{L}_{k}}$, ${y}_{i}\in (0,1)$. Our task is to build the prediction model that is used to predict the label for each instance. Figures 5 shows the basic structure of KEGRU. We use formula (2) to represent the entire flow:

$$Y={f}_{pred}({f}_{GRU}({f}_{embed}(x)))$$

(2)

where x denotes a k-mer sequence.

During k-mer embedding, a k-mer will be mapped into a vector by learning the co-occurrence statistics of each k-mer in k-mer sequence. We use a Bidirectional GRU network to capture the long-term dependencies, and then generate a fixed-length feature vector. In the prediction stage, we will obtain a prediction by performing a logistic regression on the feature representations. Given k-mer sequence x_i and model parameters Φ, the conditional likelihood of predicting label y_i is computed by⁷⁶:

$$\mathrm{log}\,p({y}_{i}|{x}_{i},{\rm{\Phi }})={y}_{i}\,\mathrm{log}\,\delta ({\beta }^{T}{h}_{i})+({\rm{1}}-{{\rm{y}}}_{{\rm{i}}}\,\mathrm{log}(1-\delta ({\beta }^{T}{h}_{i})))$$

(3)

where β denotes the prediction parameter. h_i represents the learned fixed-length feature. δ denotes the logistic sigmoid function. Our model is trained by minimizing the loss function:

$$\phi =-\,\sum _{i=1}^{N}\mathrm{log}\,p({y}_{i}|{x}_{i},{\rm{\Phi }})$$

(4)

K-mer embedding with Word2vec

If we use one-hot⁷⁷ to encode each word in a large number of text data, the word vector may be a high-dimensional vector. We have to consume more and more computer resources to store and process these vectors. To address this problem, researchers proposed the concept of distributed representation^78,79,80,81, including word embedding. Briefly, the core idea behind word embedding is to model and analyze semantic similarities between words based on their distributional properties in large samples of document data. After word embedding, words from the corpus would be mapped to vectors of real numbers, which can be processed by a neural network model.

In practical use, all the word vectors will be deposited into a matrix $WE\in {R}^{d\times N}$, where N denotes the size of the corpus and d denotes the word vector dimension. We call this matrix as the embedding layer or the lookup table layer. The embedding layer can be initialized through a pre-trained algorithm, and some algorithms have been proposed based on neural networks⁷⁰, dimensionality reduction on the word co-occurrence matrix⁷², probabilistic models⁸², and explicit representation in terms of the context in which words appear⁸³. For example, word2vec is a set of related models based on continuous bag-of-words and skip-gram. These models are sample neural networks that are trained to generate the context information of the word. As an unsupervised learning algorithm, Glove was used to learn word feature information from a corpus. Word vector is obtained through global word-word co-occurrence statistics.

In our model, each k-mer in k-mer sequence is considered as a word in the sentence. Therefore, we can use word embedding to represent a k-mer sequence at the word level. Given a k-mer sequence KS consisting of N k-mers, it can be represented as $KS=\{{k}_{1},{k}_{2},{k}_{3},\,\cdots ,\,{k}_{N}\}$. We first train a k-mer model KM through word2vec. Then, the vector of k-mer k_t is obtained by KM:

$$k{v}_{t}=KM({k}_{t})$$

(5)

Then, the k-mer sequence KS can be represented as $K{S}_{e}=\{k{v}_{1},\,k{v}_{2},\,k{v}_{3},\,\cdots ,\,k{v}_{N}\}$.

Bidirectional GRU

In order to deal with several shortcomings about the standard RNN model, a list of efforts, including Long short-term memory (LSTM)^84,85 and other similar approaches, have been proposed in this field. The GRU was proposed by Cho et al.⁸⁶. Figure 6 shows the internal structure of LSTM and GRU. In a gated recurrent neural network, each unit can control the flow of information through resetting gate and updating gate, and all memory contents are fully exposed at each time step. Besides, the output of GRU is to achieve a balance between the previous memory state and the new candidate memory state.

The update gate z_t is computed by⁶⁷

$${z}_{t}=sigmoid({W}_{z}{x}_{t}+{U}_{z}{h}_{t-1}+{b}_{z})$$

(6)

where x_t is the input vector of the GRU. h_t−1 is the previous output of the GRU. W_z, U_z and b_z are forward matrices, recurrent matrices and biases for update gate, respectively.

Similarly to the update gate, the reset gate is computed by⁶⁷

$${r}_{t}=sigmoid({W}_{r}{x}_{t}+{U}_{r}{h}_{t-1}+{b}_{r})$$

(7)

where the parameters are as above.

And then, the candidate memory state m_t is computed by⁶⁷

$${m}_{t}=\,\tanh ({W}_{h}{x}_{t}+{U}_{h}({r}_{t}\ast {h}_{t-1})+{b}_{h})$$

(8)

where σ_h is the hyperbolic tangent function. * is an element-wise multiplication.

Finally, the memory state h_t of the GRU is computed by⁶⁷

$${h}_{t}=(1-{z}_{t}){h}_{t-1}+{z}_{t}{m}_{t}$$

(9)

To make our model have a flexible input data format and can reach future input information from the current state, we used Bidirectional RNN (BiRNN). The basic idea of BiRNN is that all neurons in regular RNN are split into forward layer and backward layer, which represent the positive time direction and negative time direction, respectively. By using this structure, it is easy to capture the effect of input information from the past and future on current state. The output of BiRNN is computed by merging forward layer out and backward layer out with specific mode, like concatenate, sum, average and multiplication.

In our model, standard RNN unit is replaced by GRU. The output of BiGRU is calculated by

$$output=merge(fout,\,bout)$$

(10)

where fout is the output of forward layer. bout is the output of backward layer.

Data Availability Statement

The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.

References

Latchman, D. S. Transcription factors: an overview. The international journal of biochemistry & cell biology 29, 1305–1312 (1997).
Article CAS Google Scholar
Karin, M. Too many transcription factors: positive and negative interactions. The New Biologist 2, 126–131 (1990).
CAS PubMed Google Scholar
Pan, Y., Tsai, C.-J., Ma, B. & Nussinov, R. Mechanisms of transcription factor selectivity. Trends in Genetics 26, 75–83 (2010).
Article CAS Google Scholar
Mathelier, A., Shi, W. & Wasserman, W. W. Identification of altered cis-regulatory elements in human disease. Trends in Genetics 31, 67–76 (2015).
Article CAS Google Scholar
Weinhold, N., Jacobsen, A., Schultz, N., Sander, C. & Lee, W. Genome-wide analysis of noncoding regulatory mutations in cancer. Nature genetics 46, 1160–1165 (2014).
Article CAS Google Scholar
Friedensohn, S. & Sawarkar, R. Cis-regulatory variation: significance in biomedicine and evolution. Cell and tissue research 356, 495–505 (2014).
Article CAS Google Scholar
Deplancke, B., Alpern, D. & Gardeux, V. The genetics of transcription factor DNA binding variation. Cell 166, 538–554 (2016).
Article CAS Google Scholar
Yu, H.-J. & Huang, D.-S. Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 10, 457–467 (2013).
Article Google Scholar
Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502 (2007).
Article ADS CAS Google Scholar
Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Article ADS Google Scholar
Deng, S.-P. & Huang, D.-S. SFAPS: an R package for structure/function analysis of protein sequences based on informational spectrum method. Methods 69, 207–212 (2014).
Article CAS Google Scholar
Warner, J. B. et al. Systematic identification of mammalian regulatory motifs’ target genes and functions. Nature methods 5, 347–353 (2008).
Article CAS Google Scholar
Badis, G. et al. A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. Molecular cell 32, 878–887 (2008).
Article CAS Google Scholar
Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nature biotechnology 31, 126–134 (2013).
Article CAS Google Scholar
Wang, J. et al. Factorbook. org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium. Nucleic acids research 41, D171–D176 (2012).
Article Google Scholar
Mathelier, A. et al. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic acids research 42, D142–D147 (2013).
Article Google Scholar
Deng, S.-P., Zhu, L. & Huang, D.-S. Predicting hub genes associated with cervical cancer through gene co-expression networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 13, 27–35 (2016).
Article CAS Google Scholar
Bao, W., Huang, Z., Yuan, C.-A. & Huang, D.-S. Pupylation sites prediction with ensemble classification model. International Journal of Data Mining and Bioinformatics 18, 91–104 (2017).
Article Google Scholar
von Hippel, P. H. Increased subtlety of transcription factor binding increases complexity of genome regulation. Proceedings of the National Academy of Sciences 111, 17344–17345 (2014).
Article ADS Google Scholar
Siggers, T. & Gordân, R. Protein–DNA binding: complexities and multi-protein codes. Nucleic acids research 42, 2099–2111 (2013).
Article Google Scholar
Afek, A., Schipper, J. L., Horton, J., Gordân, R. & Lukatsky, D. B. Protein− DNA binding in the absence of specific base-pair recognition. Proceedings of the National Academy of Sciences 111, 17140–17145 (2014).
Article ADS CAS Google Scholar
Hoff, B. & Kück, U. Use of bimolecular fluorescence complementation to demonstrate transcription factor interaction in nuclei of living cells from the filamentous fungus Acremonium chrysogenum. Current genetics 47, 132–138 (2005).
Article CAS Google Scholar
Ross, M. G. et al. Characterizing and measuring bias in sequence data. Genome biology 14, R51 (2013).
Article Google Scholar
Deng, S.-P., Zhu, L. & Huang, D.-S. Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networks. BMC genomics 16, S4 (2015).
Article Google Scholar
Ghandi, M. et al. gkmSVM: an R package for gapped-kmer SVM. Bioinformatics 32, 2205–2207 (2016).
Article CAS Google Scholar
Lee, D., Karchin, R. & Beer, M. A. Discriminative prediction of mammalian enhancers from DNA sequence. Genome research 21, 2167–2180 (2011).
Article CAS Google Scholar
Zeng, H., Hashimoto, T., Kang, D. D. & Gifford, D. K. GERV: a statistical method for generative evaluation of regulatory variants for transcription factor binding. Bioinformatics 32, 490–496 (2015).
Article Google Scholar
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology 33, 831–838 (2015).
Article CAS Google Scholar
Xie, D. et al. Dynamic trans-acting factor colocalization in human cells. Cell 155, 713–724 (2013).
Article CAS Google Scholar
Zheng, C.-H., Huang, D.-S., Zhang, L. & Kong, X.-Z. Tumor clustering using nonnegative matrix factorization with gene selection. IEEE Transactions on Information Technology in Biomedicine 13, 599–607 (2009).
Article Google Scholar
Zhu, L., Deng, S.-P. & Huang, D.-S. A two-stage geometric method for pruning unreliable links in protein-protein networks. IEEE transactions on nanobioscience 14, 528–534 (2015).
Article ADS Google Scholar
Huang, D.-S. Radial basis probabilistic neural networks: Model and application. International Journal of Pattern Recognition and Artificial Intelligence 13, 1083–1101 (1999).
Article Google Scholar
Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M. A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computational biology 10, e1003711 (2014).
Article ADS Google Scholar
Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome research 21, 447–455 (2011).
Article CAS Google Scholar
Sherwood, R. I. et al. Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nature biotechnology 32, 171–178 (2014).
Article CAS Google Scholar
Wang, P. et al. Methylation-mediated silencing of the miR-124 genes facilitates pancreatic cancer progression and metastasis by targeting Rac1. Oncogene 33, 514–524 (2014).
Article CAS Google Scholar
Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nature genetics 47, 955–961 (2015).
Article CAS Google Scholar
Qin, Q. & Feng, J. Imputation for transcription factor binding predictions based on deep learning. PLoS computational biology 13, e1005403 (2017).
Article ADS Google Scholar
Zheng, C.-H., Zhang, L., Ng, V. T.-Y., Shiu, C. K. & Huang, D.-S. Molecular pattern discovery based on penalized matrix decomposition. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 8, 1592–1603 (2011).
Article Google Scholar
Huang, D.-S. & Du, J.-X. A constructive hybrid structure optimization methodology for radial basis probabilistic neural networks. IEEE Transactions on Neural Networks 19, 2099–2115 (2008).
Article Google Scholar
Bao, W., Jiang, Z. & Huang, D.-S. Novel human microbe-disease association prediction using network consistency projection. BMC bioinformatics 18, 543 (2017).
Article Google Scholar
Liu, B. BioSeq-Analysis: A platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Briefings in bioinformatics (2017).
Liu, B. et al. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic acids research 43, W65–W71 (2015).
Article CAS Google Scholar
Liu, B., Yang, F., Huang, D.-S. & Chou, K.-C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics 34, 33–40 (2017).
Article Google Scholar
Liu, B., Fang, L., Long, R., Lan, X. & Chou, K.-C. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics 32, 362–369 (2015).
Article Google Scholar
Liu, B., Li, K., Huang, D. -S. & Chou, K. -C. iEnhancer-EL: Identifying enhancers and their strength with ensemble learning approach. Bioinformatics (2018).
Liu, B., Wang, S., Long, R. & Chou, K.-C. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics 33, 35–41 (2016).
Article Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article ADS CAS Google Scholar
Schmidhuber, J. Deep learning in neural networks: An overview. Neural networks 61, 85–117 (2015).
Article Google Scholar
Hoermann, S., Bach, M. & Dietmayer, K. Dynamic Occupancy Grid Prediction for Urban Autonomous Driving: A Deep Learning Approach with Fully Automatic Labeling. arXiv preprint arXiv:1705.08781 (2017).
Zhu, L., You, Z.-H., Huang, D.-S. & Wang, B. t-LSE: a novel robust geometric approach for modeling protein-protein interaction networks. PloS one 8, e58368 (2013).
Article ADS CAS Google Scholar
Shen, Z. et al. miRNA-Disease Association Prediction with Collaborative Matrix Factorization. Complexity 2017 (2017).
Article MathSciNet Google Scholar
Bao, W., Wang, D. & Chen, Y. Classification of Protein Structure Classes on Flexible Neutral Tree. IEEE/ACM transactions on computational biology and bioinformatics (2016).
Litjens, G. et al. A survey on deep learning in medical image analysis. Medical image analysis 42, 60–88 (2017).
Article Google Scholar
Sallab, A. E., Abdou, M., Perot, E. & Yogamani, S. Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017, 70–76 (2017).
Article Google Scholar
Hinton, G. et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 82–97 (2012).
Article ADS Google Scholar
Collobert, R. et al. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537 (2011).
MATH Google Scholar
Huang, D.-S. & Zheng, C.-H. Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics 22, 1855–1862 (2006).
Article CAS Google Scholar
Chuai, G. et al. DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome biology 19, 80 (2018).
Article Google Scholar
Min, S., Lee, B. & Yoon, S. Deep learning in bioinformatics. Briefings in bioinformatics 18, 851–869 (2017).
PubMed Google Scholar
Gusmao, E. G., Allhoff, M., Zenke, M. & Costa, I. G. Analysis of computational footprinting methods for DNase sequencing experiments. Nature methods 13, 303 (2016).
Article Google Scholar
Shrikumar, A., Greenside, P. & Kundaje, A. Reverse-complement parameter sharing improves deep learning models for genomics. bioRxiv, 103663 (2017).
Zhu, L., Zhang, H.-B. & Huang, D.-S. Direct AUC optimization of regulatory motifs. Bioinformatics 33, i243–i251 (2017).
Article Google Scholar
Zeng, H., Edwards, M. D., Liu, G. & Gifford, D. K. Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics 32, i121–i127 (2016).
Article CAS Google Scholar
Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic acids research 44, e107–e107 (2016).
Article Google Scholar
Bullinaria, J. A. Recurrent neural networks. Neural Computation: Lecture 12 (2013).
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
Kim, Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).
Huang, D. -S. Systematic Theory of Neural Networks for Pattern Recognition (in Chinese) (May 1996).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed Representations of Words and Phrases and their Compositionality. arXiv preprint arXiv:1310.4546 (2013).
Serban, I. V., Sordoni, A., Bengio, Y., Courville, A. & Pineau, J. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. arXiv preprint arXiv:1507.04808 (2015).
Goldberg, Y. & Levy, O. word2vec Explained: deriving Mikolov et al.‘s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014).
Asgari, E. & Mofrad, M. R. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one 10, e0141287 (2015).
Article Google Scholar
Zhu, L., Guo, W.-L., Deng, S.-P. & Huang, D.-S. ChIP-PIT: enhancing the analysis of ChIP-Seq data using convex-relaxed pair-wise interaction tensor decomposition. IEEE/ACM Transactions on Computational Biology and Bioinformatics 13, 55–63 (2016).
Article Google Scholar
Consortium, E. P. The ENCODE (ENCyclopedia of DNA elements) project. Science 306, 636–640 (2004).
Article ADS Google Scholar
Min, X., Zeng, W., Chen, N., Chen, T. & Jiang, R. Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics 33, i92–i101 (2017).
Article Google Scholar
Harris, D. & Harris, S. Digital design and computer architecture (Morgan Kaufmann, 2010).
Bengio, Y., Ducharme, R., Vincent, P. & Jauvin, C. A neural probabilistic language model. Journal of machine learning research 3, 1137–1155 (2003).
MATH Google Scholar
Huang, D.-S. et al. Prediction of protein-protein interactions based on protein-protein correlation using least squares regression. Current Protein and Peptide Science 15, 553–560 (2014).
Article CAS Google Scholar
Huang, D.-S. & Jiang, W. A general CPL-AdS methodology for fixing dynamic parameters in dual environments. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42, 1489–1500 (2012).
Article Google Scholar
Paccanaro, A. & Hinton, G. E. Learning distributed representations of concepts using linear relational embedding. IEEE Transactions on Knowledge and Data Engineering 13, 232–244 (2001).
Article Google Scholar
Globerson, A., Chechik, G., Pereira, F. & Tishby, N. Euclidean embedding of co-occurrence data. Journal of Machine Learning Research 8, 2265–2295 (2007).
MathSciNet MATH Google Scholar
Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural computation 9, 1735–1780 (1997).
Article CAS Google Scholar
Gers, F. A., Schmidhuber, J. & Cummins, F. Learning to forget: Continual prediction with LSTM (1999).
Cho, K., Van Merriënboer, B., Bahdanau, D. & Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).

Download references

Acknowledgements

This work is partly supported by National Natural Science Foundation of China (Grant nos 61520106006, 31571364, 61732012, 61532008, U1611265, 61672382, 61772357, 61472173, 61572447, and 61672203) and China Postdoctoral Science Foundation (Grant nos 2016M601646 and 2017M611619) and supported by “BAGUI Scholar” Program of Guangxi Province of China.

Author information

Authors and Affiliations

Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P. R. China
Zhen Shen, Wenzheng Bao & De-Shuang Huang

Authors

Zhen Shen
View author publications
You can also search for this author in PubMed Google Scholar
Wenzheng Bao
View author publications
You can also search for this author in PubMed Google Scholar
De-Shuang Huang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.S.H. supervised the project. Z.S. and W.Z.B. designed the experiments. Z.S. conducted the experiments. Z.S. and W.Z.B. analyzed the results. Z.S. wrote the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to De-Shuang Huang.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Shen, Z., Bao, W. & Huang, DS. Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci Rep 8, 15270 (2018). https://doi.org/10.1038/s41598-018-33321-1

Download citation

Received: 27 April 2018
Accepted: 25 September 2018
Published: 15 October 2018
DOI: https://doi.org/10.1038/s41598-018-33321-1

Keywords

This article is cited by

Proformer: a hybrid macaron transformer model predicts expression values from promoter sequences
- Il-Youp Kwak
- Byeong-Chan Kim
- Wuming Gong
BMC Bioinformatics (2024)
GMean—a semi-supervised GRU and K-mean model for predicting the TF binding site
- Chai Wen Chuah
- Wanxian He
- De-Shuang Huang
Scientific Reports (2024)
Discovery of a non-canonical GRHL1 binding site using deep convolutional and recurrent neural networks
- Sebastian Proft
- Janna Leiz
- Maria Rutkiewicz
BMC Genomics (2023)
KDeep: a new memory-efficient data extraction method for accurately predicting DNA/RNA transcription factor binding sites
- Saeedeh Akbari Rokn Abadi
- SeyedehFatemeh Tabatabaei
- Somayyeh Koohi
Journal of Translational Medicine (2023)
High-throughput deep learning variant effect prediction with Sequence UNET
- Alistair S. Dunham
- Pedro Beltrao
- Mohammed AlQuraishi
Genome Biology (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Experiment setup

Hyper-parameter

Performance comparison with existing methods

Evaluate the effect of k-mer embedding

Sensitivity analysis

Discussion

Methods

Model architecture

K-mer embedding with Word2vec

Bidirectional GRU

Data Availability Statement

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Comments

Search

Quick links