Structural Based Strategy for Predicting Transcription Factor Binding Sites

引用 收藏 提问与回复 分享您的反馈 Cited by



Jan 2013


Scanning through genomes for potential transcription factor binding sites (TFBSs) is becoming increasingly important in this post-genomic era. The position weight matrix (PWM) is the standard representation of TFBSs utilized when scanning through sequences for potential binding sites. Many transcription factor (TF) motifs are short and highly degenerate, and methods utilizing PWMs to scan for sites are plagued by false positives. Furthermore, many important TFs do not have well-characterized PWMs, making identification of potential binding sites even more difficult. One approach to the identification of sites for these TFs has been to use the 3D structure of the TF to predict the DNA structure around the TF and then to generate a PWM from the predicted 3D complex structure. However, this approach is dependent on the similarity of the predicted structure to the native structure. We introduce here a novel approach to identify TFBSs utilizing structure information that can be applied to TFs without characterized PWMs, as long as a 3D complex structure (TF/DNA) exists. Our approach utilizes an energy function that is uniquely trained on each structure thus leads to increased prediction accuracy and robustness compared with those using a more general energy function. The software is freely available upon request. Please see reference supplementary material for details.

Keywords: ChIP-Seq (ChIP-seq), Transcription Factor (转录因子), PDB (PDB), Motif (动机), ATAC-seq (ATAC seq)

Data and Software

  1. TF/DNA structure
    tFIRE takes standard PDB format for TF/DNA structures. 
    1. A good place to look for such data is PDB (Rose et al., 2011) (http://www.pdb.org). 
    2. If  the TF/DNA complex structure does not exist but the TF structure exists, you can generate the TF/DNA structure by docking DNA to the TF structure. Software that can be utilized for this includes HADDOCK (De Vries et al., 2007) (http://www.nmr.chem.uu.nl/haddock), FTDOCK (Jackson R.M. et al., 1998) (http://www.sbg.bio.ic.ac.uk/docking/ftdock.html), YASARA DOCK (http://www.yasara.org/dnadock.htm), ParaDock (Banitt and Wolfson, 2011) (http://www.paradocks.org). Our method indicates that such docking will not affect a result significantly but we have not tested any of these docking predictions ourselves for validation. Hence we urge their use with caution, and revalidate the results once the structures are available. 
    3. If the TF structure does not exist in 3D structure databases, you can predict TF structure using homology modeling like SWISS-MODEL (Guex and Peitsch, 1997) (http://swissmodel.expasy.org), Rosetta (Bradley et al., 2003) (https://www.rosettacommons.org), Sybyl (Visegrády et al., 2001) (http://www.tripos.com/index.php?family=modules,SimplePage,,,&page=SYBYL-X). Please use with caution that each prediction step would reduce the accuracy.

  2. Predict TFBSs
     tFIRE predicted motif(PWM) can be used to predict TFBSs. 
    1. Motif scanning programs can be used to scan the whole genome for motif matches. Such methods included MAST (Bailey et al., 2006) (http://meme.nbcr.net/meme/cgi-bin/mast.cgi) from MEME suite and STORM (Schones et al., 2007) (http://rulai.cshl.edu/storm) from Cold Spring Harbor Laboratory.
    2. TFBSs vary for different cells. Recently, a newly developed method, CENTIPEDE (Pique-Regi et al., 2010) (http://centipede.uchicago.edu) shows that with the result from a single DNase-seq experiment, one can accurately predict TFBSs for all TFs. Therefore, downloading DNase-seq data from ENCODE project can be very helpful.
    3. Recently developed FAIRE-seq technology allow similar predictions for detection of chromatin accessibility regions (Song et al., 2011). Such data can be substituted for DNase-seq, but needs be tested and validated before use. 
    4. Epigenetic information can also be employed instead of DNase-seq (Cuellar-Partida et al., 2012). We propose to update our data and methods available for such predictions on a regular basis. 

  3. Software
    1. C++ software environment, better with Linux system
    2. tFIRE, Feel Free to ask the author for a linux version
    3. WebLogo (Crooks et al., 2004) can be used for visualization the PWM we predicted (http://weblogo.berkeley.edu


  1. If you are confident of your TF/DNA complex 3D structure, then you can use tFIRE default function pre-trained by all available TF/DNA structures in the PDB database. You can also construct your own energy function with tFIRE by several non-homology structures. You can use the PISCES server (Wang and Dunbrack, 2003) (http://dunbrack.fccc.edu/PISCES.php) that this server will give you a subset of your input structure list (PDB id) that each protein in the subset has little homology to another.
  2. If you are not confident of your TF/DNA structure, you can train tFIRE with a single structure and subsequently predict PWMs using tFIRE.


This protocol have been adapted from: Xu et al. (2013). We thank the funding supported by the National Sciences Foundation of China (no. 31070641) and National 973 Program of China (no. 2012CB721000) and start-up funding from SKLMRD and DICP, CAS (Chinese Academy of Sciences). The funders offered most of the costs of study design, data collection and analysis, decision to publish, or preparation of the manuscript. We would like to thank Dr. Yan Cui, Dr. Yaoqi Zhou, Dr. Yuedong Yang, Dr. Chi Zhang, Dr. Song Liu, Dr. Jason Donald, Dr. Eugene Shakhnovich, Dr. Timothy Robertson, Dr. Gabriele Varani, Dr. Marc Jung, Dr. Amy Leung and Dr. Rongze Lu, Juan Du for their databases, programs and helpful discussions.


  1. Bailey, T. L., Williams, N., Misleh, C. and Li, W. W. (2006). MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res 34(Web Server issue): W369-373.
  2. Banitt, I. and Wolfson, H. J. (2011). ParaDock: a flexible non-specific DNA--rigid protein docking algorithm. Nucleic Acids Res 39(20): e135.
  3. Bradley, P., Chivian, D., Meiler, J., Misura, K. M., Rohl, C. A., Schief, W. R., Wedemeyer, W. J., Schueler-Furman, O., Murphy, P., Schonbrun, J., Strauss, C. E. and Baker, D. (2003). Rosetta predictions in CASP5: successes, failures, and prospects for complete automation. Proteins 53 Suppl 6: 457-468.
  4. Crooks, G. E., Hon, G., Chandonia, J. M. and Brenner, S. E. (2004). WebLogo: a sequence logo generator. Genome Res 14(6): 1188-1190.
  5. Cuellar-Partida, G., Buske, F. A., McLeay, R. C., Whitington, T., Noble, W. S. and Bailey, T. L. (2012). Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics 28(1): 56-62. 
  6. de Vries, S. J., van Dijk, A. D., Krzeminski, M., van Dijk, M., Thureau, A., Hsu, V., Wassenaar, T. and Bonvin, A. M. (2007). HADDOCK versus HADDOCK: new features and performance of HADDOCK2.0 on the CAPRI targets. Proteins 69(4): 726-733.
  7. Guex, N. and Peitsch, M. C. (1997). SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis 18(15): 2714-2723.
  8. Jackson, R. M., Gabb, H. A. and Sternberg, M. J. (1998). Rapid refinement of protein interfaces incorporating solvation: application to the docking problem. J Mol Biol 276(1): 265-285.
  9. Pique-Regi, R., Degner, J. F., Pai, A. A., Gaffney, D. J., Gilad, Y. and Pritchard, J. K. (2011). Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res 21(3): 447-455.
  10. Rose, P. W., Beran, B., Bi, C., Bluhm, W. F., Dimitropoulos, D., Goodsell, D. S., Prlic, A., Quesada, M., Quinn, G. B., Westbrook, J. D., Young, J., Yukich, B., Zardecki, C., Berman, H. M. and Bourne, P. E. (2011). The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Res 39(Database issue): D392-401.
  11. Schones, D. E., Smith, A. D. and Zhang, M. Q. (2007). Statistical significance of cis-regulatory modules. BMC Bioinformatics 8: 19.
  12. Song, L., Zhang, Z., Grasfeder, L. L., Boyle, A. P., Giresi, P. G., Lee, B. K., Sheffield, N. C., Graf, S., Huss, M., Keefe, D., Liu, Z., London, D., McDaniell, R. M., Shibata, Y., Showers, K. A., Simon, J. M., Vales, T., Wang, T., Winter, D., Clarke, N. D., Birney, E., Iyer, V. R., Crawford, G. E., Lieb, J. D. and Furey, T. S. (2011). Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res 21(10): 1757-1767.
  13. Visegrady, B., Than, N. G., Kilar, F., Sumegi, B., Than, G. N. and Bohn, H. (2001). Homology modelling and molecular dynamics studies of human placental tissue protein 13 (galectin-13). Protein Eng 14(11): 875-880.
  14. Wang, G. and Dunbrack, R. L., Jr. (2003). PISCES: a protein sequence culling server. Bioinformatics 19(12): 1589-1591.
  15. Xu, B., Yang, Y., Liang, H. and Zhou, Y. (2009). An all-atom knowledge-based energy function for protein-DNA threading, docking decoy discrimination, and prediction of transcription-factor binding profiles. Proteins 76(3): 718-730.
  16. Xu, B., Schones, D. E., Wang, Y., Liang, H. and Li, G. (2013). A structural-based strategy for recognition of transcription factor binding sites. PLoS One 8(1): e52460.



关键字:ChIP-seq, 转录因子, PDB, 动机, ATAC seq


  1. TF/DNA结构
    1. 寻找此类数据的好地方是PDB(Rose 等,2011)( http://www.pdb.org )。
    2. 如果TF/DNA复合物结构不存在但TF结构存在,则可以通过将DNA对接到TF结构来产生TF/DNA结构。可用于此的软件包括HADDOCK(De Vries等人,2007)( http://www.nmr.chem.uu.nl/haddock ),FTDOCK(Jackson RM等人,1998)( http://www.sbg.bio.ic.ac.uk/docking/ftdock.html ),YASARA DOCK( http://www.yasara.org/dnadock.htm ),ParaDock (Banitt和Wolfson,2011)( http://www.paradocks.org )。我们的方法表明这种对接不会显着影响结果,但我们没有测试任何这些对接预测我们自己进行验证。因此,我们谨慎使用它们,并在结构可用后重新验证结果。
    3. 如果TF结构在3D结构数据库中不存在,你可以使用像SWISS-MODEL这样的同源建模来预测TF结构(Guex和Peitsch,1997)( http://swissmodel.expasy.org ),Rosetta(Bradley等人,2003)( https://www.rosettacommons.org ),Sybyl(Visegrády等人,2001)( http://www.tripos.com/index.php?family=modules,SimplePage,,,,amp ; page = SYBYL-X )。请谨慎使用每个预测步骤会降低准确性。

    1. 预测TFBS
      1. 基序扫描程序可用于扫描整个基因组的基序匹配。这样的方法包括MAST(Bailey等人,2006)( http://meme.nbcr.net/meme/cgi-bin/mast.cgi )从MEME套件和STORM(Schones 。 ,2007)(来自Cold Spring Harbor Laboratory的 http://rulai.cshl.edu/storm )。
      2. TFBS对于不同的细胞是不同的。 最近,新开发的方法CENTIPEDE(Pique-Regi等人,2010)( http://centipede.uchicago.edu )显示,利用来自单个DNase-seq实验的结果,可以准确地预测所有TF的TFBS。 因此,从ENCODE项目下载DNase-seq数据可能非常有帮助。
      3. 最近开发的FAIRE-seq技术允许用于检测染色质可及性区域的类似预测(Song等人,2011)。 此类数据可以替代DNase-seq,但需要在使用前进行测试和验证。
      4. 还可以使用表观遗传信息代替DNase-seq(Cuellar-Partida等人,2012)。 我们建议定期更新可用于此类预测的数据和方法。

    2. 软件
      1. C ++软件环境,更好用Linux系统
      2. tFIRE,Feel可以问作者一个linux版本
      3. WebLogo(Crooks 等人,2004)可用于可视化我们预测的PWM( http://weblogo.berkeley.edu ) 


    1. 如果你对你的TF/DNA复杂的3D结构有信心,那么你可以使用tFIRE默认功能,由PDB数据库中所有可用的TF/DNA结构预训练。 您还可以通过几个非同源结构用tFIRE构建您自己的能量函数。 您可以使用PISCES服务器(Wang和Dunbrack,2003)( http://dunbrack.fccc.edu/PISCES .php ),这个服务器会给你 您的输入结构列表(PDB id)的子集,该子集中的每个蛋白质与另一个蛋白质几乎没有同源性
    2. 如果您对TF/DNA结构不确信,您可以使用单一结构训练tFIRE,然后使用tFIRE预测PWM。


    该协议已经改编自:Xu等人。(2013)。我们感谢中国国家科学基金会(编号31070641)和国家973计划(编号2012CB721000)资助的资金以及SKLMRD和中国科学院(中科院)DICP的启动资金。资助者提供了研究设计,数据收集和分析,决定发布或准备手稿的大部分成本。我们要感谢Yan Cui博士,周耀耀博士,杨裕东博士,张志博士,刘博士博士,Jason Donald博士,Eugene Shakhnovich博士,Timothy Robertson博士,Gabriele Varani博士, Marc Jung博士,梁博士博士和Rongze Lu博士,杜娟为他们的数据库,计划和有益的讨论。


    1. Bailey,T.L.,Williams,N.,Misleh,C.and Li,W.W。(2006)。 MEME:发现和分析DNA和蛋白质序列基序。 Nucleic Acids Res 34(Web Server问题):W369-373。
    2. Banitt,I。和Wolfson,H.J。(2011)。 ParaDock:灵活的非特异性DNA - 刚性蛋白对接算法。 Nucleic Acids Res 39(20):e135。
    3. Bradley,P.,Chivian,D.,Meiler,J.,Misura,KM,Rohl,CA,Schief,WR,Wedemeyer,WJ,Schueler-Furman,O.,Murphy,P.,Schonbrun,J.,Strauss, CE和Baker,D。(2003)。 CASP5中的Rosetta预测:成功,失败以及完全自动化的前景。 Proteins 53 Suppl 6:457-468。
    4. Crooks,G.E.,Hon,G.,Chandonia,J.M.and Brenner,S.E。(2004)。 WebLogo:序列标志生成器。 Genome Res 14(6):1188-1190。
    5. Cuellar-Partida,G.,Buske,F.A.,McLeay,R.C.,Whitington,T.,Noble,W.S.and Bailey,T.L。(2012)。 用于鉴定活性转录因子结合位点的表观遗传学先验。 Bioinformatics 28(1):56-62。 
    6. de Vries,S.J.,van Dijk,A.D.,Krzeminski,M.,van Dijk,M.,Thureau,A.,Hsu,V.,Wassenaar,T.and Bonvin,A.M。 HADDOCK与HADDOCK:HADDOCK2.0在CAPRI目标上的新功能和性能。 蛋白 69(4):726-733
    7. Guex,N.和Peitsch,MC(1997)。 SWISS-MODEL和Swiss-PdbViewer:比较蛋白质建模的环境。 18(15):2714-2723。
    8. Jackson,R.M.,Gabb,H.A。和Sternberg,M.J。(1998)。 快速完善包含溶剂化的蛋白质界面:应用于对接问题。 J Mol Biol 276(1):265-285。
    9. Pique-Regi,R.,Degner,J.F.,Pai,A.A.,Gaffney,D.J.,Gilad,Y.and Pritchard,J.K。(2011)。 从DNA序列和染色质可及性数据精确推断转录因子结合。 Genome Res 21(3):447-455。
    10. Rose,PW,Beran,B.,Bi,C.,Bluhm,WF,Dimitropoulos,D.,Goodsell,DS,Prlic,A.,Quesada,M.,Quinn,GB,Westbrook,JD,Young, Yukich,B.,Zardecki,C.,Berman,HM和Bourne,PE (2011)。 RCSB蛋白质数据库:重新设计的网站和网络服务。 Acids Res 39(数据库问题):D392-401。
    11. Schones,D.E.,Smith,A.D.and Zhang,M.Q.(2007)。 顺式调节模块的统计显着性。 BMC生物信息学 8:19.
    12. Song,L.,Zhang,Z.,Grasfeder,LL,Boyle,AP,Giresi,PG,Lee,BK,Sheffield,NC,Graf,S.,Huss,M.,Keefe,D.,Liu,这些研究结果表明,这些研究结果表明,这些研究结果表明,这些研究结果表明, Crawford,GE,Lieb,JD和Furey,TS(2011)。 由DNaseI和FAIRE定义的打开染色质可识别形成细胞类型身份的调节元件。 Genome Res   21(10):1757-1767。
    13. Visegrady,B.,Than,N.G.,Kilar,F.,Sumegi,B.,Than,G.N。和Bohn,H。(2001)。 人类胎盘组织蛋白13(galectin-13)的同源建模和分子动力学研究 。 Protein Eng 14(11):875-880。
    14. Wang,G。和Dunbrack,R.L.,Jr。(2003)。 PISCES:蛋白质序列剔除服务器。 Bioinformatics 19(12):1589-1591。
    15. Xu,B.,Yang,Y.,Liang,H.and Zhou,Y。(2009)。 蛋白质-DNA穿线,接合诱饵辨别和预测的全原子基于知识的能量函数 的转录因子结合谱。 蛋白 76(3):718-730。
    16. Xu,B.,Schones,D.E.,Wang,Y.,Liang,H.and Li,G。(2013)。 基于结构的转录因子结合位点识别策略。 PLoS One 8(1):e52460。
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2013 The Authors; exclusive licensee Bio-protocol LLC.
引用:Xu, B., Wang, Y., Liang, H. and Li, G. (2013). Structural Based Strategy for Predicting Transcription Factor Binding Sites . Bio-protocol 3(12): e794. DOI: 10.21769/BioProtoc.794.