参见作者原研究论文

本实验方案简略版
Mar 2018

本文章节


 

High-resolution and Deep Phylogenetic Reconstruction of Ancestral States from Large Transcriptomic Data Sets
从大型转录组数据集中进行原始状态的高通量深度系统发育重建   

引用 收藏 提问与回复 分享您的反馈 Cited by

Abstract

Phylogenetics is an important area of evolutionary biology that helps to understand the origin and divergence of genes, genomes and species. Building meaningful phylogenetic trees is needed for the accurate reconstruction of the past. To achieve a correct phylogenetic understanding of genes or proteins, reliable and robust methods are needed to construct meaningful trees. With the rapidly increasing availability of genome and transcriptome sequencing data, there is a need for efficient and accurate methodologies for ancestral state reconstruction. Currently available methods are mostly specific for certain gene families, and require substantial adaptation for their application to other gene families. Hence, a generalized framework is essential to utilize large transcriptome resources such as OneKP and MMETSP. Here, we have developed a flexible yet efficient method, based on core strengths such as emphasis on being inclusive in homolog selection, and defining orthologs based on multi-layered inferences. We illustrate how specific steps can be modified to fit the needs of any protein family under consideration. We also demonstrate the success of this protocol by studying and testing the orthologs in various gene families. Taken together, we present a protocol for reconstructing the ancestral states of various domains and proteins across multiple kingdoms of eukaryotes, using thousands of transcriptomes.

Keywords: Phylogenomics (系统发育基因组学), OneKP (OneKP), MMETSP (MMETSP), Plants (植物), Phylogenetics (系统发育), Evolution (演化), Transcriptome (转录组)

Background

Phylogenetic trees are fundamental to understanding the evolution of genes, gene families, species, phyla and even kingdoms. They help to depict the diversity and also resolve the differences at various levels. For example, at protein level, they help us to identify orthologous groups based on amino acid differences across various species. Earlier, phylogenetic trees were constructed based on few gene/protein sequences from limited numbers of species. With the ever-growing sequencing data, as more and more genomes and transcriptomes are becoming accessible, there is tremendous potential for e.g., discovery of new lineages, ‘gap-filling’ in phylogenies and hence, an improved understanding of biology (Levy and Myers, 2016; Burki et al., 2019).

In the last decade, many efforts have been made towards defining transcriptomes of hundreds (or even thousands) of species due to the popularity of RNA-Seq (Stark et al., 2019). Transcriptomes provide a quick insight into the (expressed) gene content of a genome. Even though the individual transcriptomes do not cover the entire gene content of an organism, combining them from multiple cells, tissues and conditions, may comprise the majority of the transcribed genes of that species. Hence, it is a relatively straightforward approach to sequence and assemble a transcriptome without a priori knowledge of the genome. The current-day long-read and single-cell RNA-sequencing technologies make it even easier to build a complete transcriptome (Wang et al., 2016). Utilizing these technological advances, two large transcriptome sequencing projects, 1000 plant transcriptomes (OneKP; Carpenter et al., 2019; One Thousand Plant Transcriptomes Initiative, 2019) and Marine Micro Eukaryote Transcriptome Sequencing Project (MMETSP; Keeling et al., 2014), were developed. OneKP represents the majority of the land plants and algal groups, whereas MMETSP covers majority of the SAR group and other (unidentified) phyla in Chromista.

From their inception, diverse approaches have been developed and applied to these transcriptomes and estimate the ancestral states of various genes across multiple classes, families and even phyla (Li et al., 2014; Wickett et al., 2014; Yerramsetty et al., 2016). The majority of these methods focus on one gene family, and need substantial modifications in methodology to apply them to other gene families. Moreover, the methods used are neither inclusive nor robust in terms of multi-layered inferences. The orthologous inferences are based on only one evidence, Best Bi-directional Hit or protein domains or simple phylogenies based on few genomes. To overcome these disadvantages, we developed a unified framework to build high-resolution phylogenies that utilize the rich OneKP and MMETSP transcriptome resources. This new method is not only inclusive, but also utilizes multi-layered orthology to interpret phylogenies with high confidence, leading to the identification of new (sub-)classes of orthologs.

Overview of the protocol
The current protocol is developed to reconstruct ancestral states and high-resolution phylogenetic trees of various gene families using transcriptomes and/or proteomes. Ancestral state represents the minimal gene complement at each evolutionary node, where species-specific gene duplications and (or) losses would have modified the gene complement in individual species. Hence, selecting the correct, orthologous as well as diverse, sequences is a crucial step in such a deep phylogenetic tree construction. This protocol is built on three core strengths: (1) Inclusive: Include more sequences at the start with liberal parameters, and remove sequences as one goes through various steps in the pipeline, resulting in a high-quality logical sequence set for phylogenetic tree construction. (2) Multi-layered: Multiple levels of orthology confirmation, i.e., based on the domain architecture, reciprocal BLAST and the phylogenetic tree. (3) Robust: No limitations on length of the protein or the number of sequences used in the phylogeny, with suggestions on alternate analysis packages tested in various steps. Overall, the protocol comprises 14 steps that are divided into three sections: Homolog identification (Steps 1-5), Ortholog detection (Steps 6-8) and Phylogeny construction (Steps 9-14). All the general parameters and recommendations for the respective steps are indicated below.

Equipment

  1. Linux machine
    Computer set-up: Majority of the mentioned programs in Software section run only on Linux environment; hence it is recommended to perform the analysis on a Linux machine with access to the BASH shell (terminal). The average time needed to perform the analysis for a gene family is 1-1.5 days on a generic Linux workstation with 64 GB RAM and 8-core processor setup. The disk space needed for this analysis is less than 1 GB.

Software

  1. tblastn and blastp from BLAST+ module v2.9.0 (Camacho et al., 2009) (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
  2. faSomeRecords: Linux binary from UCSC (http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/)
  3. TransDecoder v5.5.0 (Haas et al., 2013) (transdecoder.github.io)
  4. MEME motif discovery v5.1.0 (Bailey et al., 2009) (http://meme-suite.org/)
  5. ScanProsite web-tool (https://prosite.expasy.org/scanprosite)
  6. InterProScan v5.38-76.0 (Jones et al., 2014) (https://github.com/ebi-pf-team/interproscan)
  7. MAFFT v7 (Katoh and Standley, 2013) (https://mafft.cbrc.jp/alignment/software/)
  8. JalView (Waterhouse et al., 2009) (https://www.jalview.org/)
  9. ModelFinder (Kalyaanamoorthy et al., 2017) (accessed as in-built module from IQ-TREE)
  10. ModelTest-NG (Darriba et al., 2020) (https://github.com/ddarriba/modeltest)
  11. PartitionFinder v2 (Lanfear et al., 2012) (http://www.robertlanfear.com/partitionfinder/)
  12. IQ-TREE v1.6.12 (Nguyen et al., 2015) (http://www.iqtree.org)
  13. RAxML v8 (Stamatakis, 2014) (https://cme.h-its.org/exelixis/web/software/raxml/index.html)
  14. PhyML v3.3 (Guindon et al., 2010) (https://github.com/stephaneguindon/phyml)
  15. MrBayes v3.2.7 (Ronquist et al., 2012) (https://github.com/NBISweden/MrBayes)
  16. iTOL v4 (Letunic and Bork, 2019) (https://itol.embl.de)
  17. Linux BASH shell (terminal) ‘cut, sort and uniq’ functions (https://tiswww.case.edu/php/chet/bash/bashref.html)
  18. Scripts used for automating certain steps in the protocol are available through GitHub (https://github.com/sumanthmutte/Phylogenomics)

Data
  1. OneKP dataset (1000 plant transcriptomes project): Contains 1341 transcriptomes from 1179 species covering all the major classes of land plants, green algae, red algae and glaucophytes (Carpenter et al., 2019; One Thousand Plant Transcriptomes Initiative, 2019); http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/oneKP_capstone_2019
  2. MMETSP dataset (Marine Microbial Eukaryote Transcriptome Sequencing Project): Contains 678 transcriptomes from 410 species covering all the major classes of Stramenopila and Alveolata (SAR group) and many unclassified (unicellular) marine eukaryotes (Keeling et al., 2014); https://gold.jgi.doe.gov/study?id=Gs0128947

Procedure

Commands used, along with the parameters used in each step of the protocol, with step numbers corresponding to Figure 1 are given below. Before starting the protocol, we first created a BLAST database for each transcriptome or proteome. This was carried out only once for each transcriptome or proteome using the makeblastdb function, where ‘-in’ takes a FASTA file of the transcriptome, or the proteome and ‘-dbtype’ is the database type with nucl and prot for transcriptomes and proteomes, respectively.

$ makeblastdb -dbtype nucl -in transcriptome.fasta

$ makeblastdb -dbtype prot -in proteome.fasta


Figure 1. Methodology schematic showing various steps of the protocol used for ortholog identification and phylogenetic tree construction. Circled numbers correspond to the various steps of the protocol as indicated in the procedure. Programs/software/algorithms used are indicated next to the arrows in grey. File formats for text and FASTA are depicted as shown in legend.


  1. Homolog identification
    1. To perform a BLAST search to the respective database(s), we created a query protein sequence file (in FASTA format), with sequences from (relatively) well-annotated genomes and from a diverse range of species, if present, across multiple kingdoms. A list of various species used along with a link to the sequence data resource is available in Appendix-1.
    2. Using the query sequence file (-query) perform the BLAST search with tblastn and blastp modules, against transcriptome and proteome databases (-db), respectively. When the E-value cut-off (-evalue) is less than 0.01, save the output (-out) in a tab-delimited text file indicated with -outfmt 6 . The remainder of the parameters are kept at default settings.

      $ tblastn -query filename.fa -db transcriptome.fasta -out output.blast -evalue 0.01 -outfmt '6 qseqid sseqid slen qstart qend sstart send evalue bitscore score length pident nident positive ppos mismatch gaps frames qcovs qcovhsp sseq'
      $ blastp -query filename.fa -db proteome.fasta -out output.blast -evalue 0.01 -outfmt '6 qseqid sseqid slen qstart qend sstart send evalue bitscore score length pident nident positive ppos mismatch gaps frames qcovs qcovhsp sseq'

    3. The BLAST output contains all the scoring information about the subject (transcript/protein) sequence that has a similarity to the corresponding query sequence. To retrieve the subject sequence identifiers from the BLAST output, we have used the ‘cut’, ‘sort’ and ‘uniq’ functions of a Linux BASH shell (terminal). ‘cut’ takes the BLAST output (output.blast) from the previous step, and takes the second column (-f2), i.e., subject sequence identifiers and sends/pipes them (|) to the ‘sort’ function. After sorting, they are passed on to the ‘uniq’ function to remove the duplicates and the output is written to the file (SubjectIdentifiers.txt).

      $ cut -f2 output.blast | sort | uniq > SubjectIdentifiers.txt

    4. Using these identifiers (SubjectIdentifiers.txt) to extract the corresponding transcript (SelectedTranscripts.fasta) or protein sequences (SelectedProteins.fasta) from the respective transcriptome or proteome by running the ‘faSomeRecords’ program.

      $ faSomeRecords transcriptome.fasta SubjectIdentifiers.txt SelectedTranscripts.fasta
      $ faSomeRecords proteome.fasta SubjectIdentifiers.txt SelectedProteins.fasta

    5. The protein sequences are more informative due to the higher number of site patterns and can be directly used for phylogeny construction. Whereas, the transcript sequences should be translated to protein sequences using the program TransDecoder with default settings. First, determine the longest Open Reading Frames (ORFs of at least 100 amino acids in length) of the transcript by TransDecoder.LongOrfs. And then the CDS and the corresponding amino acid sequences of these ORFs thorugh TransDecoder.Predict. If the tree based on protein sequences result in poor bootstraps, we would suggest generating the tree with CDS (DNA) sequences.

      $ perl TransDecoder.LongOrfs -t SelectedTranscripts.fasta
      $ perl TransDecoder.Predict -t SelectedTranscripts.fasta

  2. Ortholog detection
    1. Not all the sequences that have an E-value < 0.01 are true orthologs of a query protein. Hence, additional filters are needed to remove non-orthologs. One such filter is the presence of the same domains in orthologous proteins. For some well-annotated proteins (e.g., Auxin Response Factors, Kinases, etc.), domain information is readily available in the InterPro domain database. Scan the protein sequences from the previous step (-i SelectedProteins.fasta) for the presence of known domains using InterProScan tool (interproscan.sh), which produces a tab-delimited (TSV) file as well as HTML/XML files (-f TSV,HTML,XML), with all the domains identified along with the corresponding InterPro identifiers (-iprlookup) in each protein sequence. A Python script was developed (InterproscanSummary.py) to process this TSV file, in order to extract the final set of protein sequences that have the domains of interest (See GITHUB page for more details). InterProScan is a time-consuming step, hence we used pre-annotated data where available, or reduced the number of databases to scan (using -appl Pfam,CDD setting), in order to save time. In some cases, we split the data in smaller batches and ran on multiple processors.

      $ interproscan.sh -f TSV,HTML,XML -iprlookup -i SelectedProteins.fasta
      $ python InterproscanSummary.py

    2. Certain proteins (e.g., SOSEKI in Arabidopsis; Yoshida et al., 2019) lack annotated (functional) domain information. Use the MEME program to predict the conserved motifs/domains in those proteins with Zero or One Occurrence Per Sequence criteria (-mod zoops) and a minimum width of 10 (-minw), with a maximum of 10 motifs predicted per set (-nmotifs). The MEME outputs the motifs along with their patterns in HTML/TEXT format. Then use these motif patterns in ScanProsite web-tool to identify the domains in the protein sequences that do not have annotated domains. We have applied this approach successfully to annotate the SOSEKI protein family and identify its orthologs (van Dop et al., 2020).

      $ meme ProteinSequences.fa -o OutputName -protein -mod zoops -nmotifs 10 -minw 10

    3. After selecting the protein sequences that have the domains of interest, they are queried back to the proteomes of the species used in Step A1 to confirm the orthologous relationships using the best Bi-directional BLAST Hits (BBH) strategy. Here we have used the option of maximum target sequences or the number of best hits in the output (-max_target_seqs) set to 1, or sometimes 2 when domains are abundant in the genome (for e.g., bHLH), with E-value < 0.01 (-evalue). This final set of proteins that have hits with the protein under consideration are regarded as the ‘true’ orthologous proteins for further analysis. Output is recorded in a TSV files, same as in Step A2 (-outfmt 6).

      $ blastp -query filename.fa -db ArabidopsisProteome.fasta -out BBhits.blastp -max_target_seqs 1 -evalue 0.01 -outfmt '6 qseqid sseqid slen qstart qend sstart send evalue bitscore score length pident nident positive ppos mismatch gaps frames qcovs qcovhsp sseq'

  3. Phylogeny construction
    1. These ‘true’ sets of orthologs are used for alignment followed by the phylogenetic tree construction. MAFFT is used to align protein sequences. The E-INS-i (--genafpair) algorithm is used while aligning proteins with multiple domains separated by poorly conserved sequences (e.g., ARF or Aux/IAA proteins), whereas G-INS-i (--globalpair) is used while aligning only domain-specific sequences (e.g., PB1 domain). An iterative refinement method is used in both cases, with a maximum of 1000 iterations (--maxiterate 1000), after which the final alignment is written to a FASTA file (output_file).

      $ mafft --genafpair --maxiterate 1000 input_file > output_file
      $ mafft --globalpair --maxiterate 1000 input_file > output_file

    2. Once the alignments are generated, use the trimAl to remove the sequence positions (columns) with more than 50%-80% gaps, as they are considered to lack phylogenetic signal. Hence, for phylogenetic tree construction, only use the sequences without spurious gaps. A gap-threshold of 0.2 (-gt 0.2), is set to remove all positions in the alignment with gaps in 80% (or more) of the sequences. For the gene families that have moderately conserved domains (e.g., ARF, Aux/IAA), use a threshold of 0.3 or 0.4, whereas for poorly conserved domains (e.g., PB1) it is set at 0.2, and for highly conserved proteins (e.g., ROP, ROPGEF) it is set between 0.6 and 0.8. An additional (optional) check is kept in place, where the sequences that are shorter than 1/4th of the average sequence length are further removed in JalView.
      Note: There are various tools specialized for the clean-up of the alignment, such as GBlocks, Guidance, AliScore, ZORRO etc. However, a simple gap-based trimming in trimAl resulted in (almost) the same quality of alignment and tree topology when compared to these specialized tools. Hence, we used trimAl for alignment clean-up throughout this study.

      $ trimal -in inputfile.fa -out outputfile.fa -fasta -gt 0.2

    3. Then use this ‘clean’ alignment to identify the most appropriate model of evolution for each protein family. ModelFinder and ModelTest-NG are used to predict the best model based on the Akaike- and Bayesian- Information Criterion (AIC and BIC). For the majority of the protein families, both programs provide the same models as the best models. The situations where there is a mis-match between the two programs, use a third program (either PartitionFinder or a Perl script from RAxML distribution) to decide on the best model based on the majority rule. As expected, various proteins evolve differently, leading to different models of evolution. ModelFinder is run as a part of IQ-TREE, hence it does not require any additional steps. ModelTest-NG requires the type (either amino acid or nucleotide -d) of input dataset (-i INFILE) and writes the statistics and the best model to the output file (-o OUTFILE). PartitionFinder requires the alignment, in the PHYLIP format (instead of FASTA format as in others) placed in the folder ‘partition_finder_models’, where the output statistics and best model are also recorded. FASTA to PHYLIP format conversion can be made through the Perl script (fasta2relaxedPhylip.pl), which takes input FASTA (-f input.fa) and writes the output in PHYLIP format (-o output.phylip).

      $ modeltest-ng -d aa -i INFILE -o OUTFILE
      $ perl RAxML_ProteinModelSelection.pl alignment.fasta
      $ perl fasta2relaxedPhylip.pl -f input.fa -o output.phylip
      $ python PartitionFinderProtein.py partition_finder_models

    4. Phylogenetic trees are built mainly using IQ-TREE and RAxML based on the ‘clean’ alignment produced in Step C10 and the evolutionary model predicted in Step C11. For the phylogenetic trees made through IQ-TREE, we have used 1,000 rapid bootstraps (-bb 1,000) and SH-like approximate Likelihood Ratio Test (-aLRT 1,000), combined with automatic model finding through ModelFinder (-m MFP+MERGE). For the trees made with RAxML, we have also used rapid bootstrapping and Maximum Likelihood search in the same run (-f a) but with an extended majority rule (-# autoMRE) based bootstopping criteria. In addition, we gave a random seed number (-x and -p) to turn-on rapid bootstrapping and parsimony inference, whereas -m takes in the model from the previous Step C11. For trees with very poor bootstrap support for majority of the branches, we used another phylogenetic tree construction program, PhyML, with 100 bootstrap replicates (-b 100), empirical amino-acid frequencies (-f e), gamma shape parameter estimated from maximum likelihood (-a e) and the topology was searched based on the sub-tree pruning and re-grafting approach (-s SPR). After running these multiple programs, the trees obtained were compared to understand the overall topology based on the congruent branches (see next step). We have also tried and tested various Bayesian approaches (using MrBayes), but the trees never converged even after months of computation, and provided various incongruent topologies. Hence, all the analyses were performed with Maximum Likelihood approaches.

      $ iqtree -s CleanAlignment.fa -pre OutputName -alrt 1000 -bb 1000 -m MFP+MERGE
      $ raxmlHPC-PTHREADS-AVX2 -f a -x 12345 -p 12345 -j -# autoMRE -m PROTGAMMAJTT -s CleanAlignment.fa -n OutputName
      $ PhyML-3.1_linux64 -i CleanAlignment.fa -d aa -b 100 -m JTT -f e -s SPR -a e

    5. Visualize all the final phylogenetic trees using the iTOL webserver and then various datasets on the phylogenetic trees. Generate protein domain information from the InterProScan or MEME, sequence length from TransDecoder and clade/taxonomy information from OneKP and MMETSP databases following the instructions provided in the iTOL documentation.
    6. Once the trees are obtained, they are manually checked for errors. Manually remove the branches with long branch attraction, or partial sequences or any misplaced taxa. If the proportion of these misplaced branches is too high, re-analyze the phylogeny with more sequences from other species, as well as by removing the spurious sequences. These steps are repeated until obtain better trees that are not only supported by good bootstraps but also obeys the taxonomy of those phyla.

    Limitations and Conclusions
    Due to the generalized nature of the method, it was difficult to automate the complete protocol. Hence, wherever possible, the method was simplified with scripts/commands dedicated for fast and parallel processing. On the other hand, it gave control over the decision-making process based on the protein under consideration. When dealing with highly redundant protein families, we removed highly similar proteins (> 90% similarity), prior to phylogeny, which reduced the (computational) time without losing accuracy. In many cases we observed that the best-hit in reciprocal-BLAST is not really a BBH, as sometimes a second hit was still the best one due to one or few amino acid difference(s) (especially in proteins with common domains e.g. bHLH or PB1). Hence, in those cases we considered two best hits and used both for phylogeny construction. The false positive orthologs were eventually placed in the outgroup (or at least separate from the ingroup) in the phylogenetic tree. As we were dealing with transcriptomes, we could not predict the actual gene copy number in each species, but only the ancestral copy number for that class or phylum, by comparing the ancestral copies across the majority of the species in that phylum. Another issue of dealing with (low-depth) transcriptomes was that we found many partial transcripts leading to the truncated proteins/domains, or we might fail to identify the transcripts that were not expressed in that particular tissue or condition. In that regard, combining ortholog sequence information from multiple transcriptomes or species of various families is mandatory to confirm the ancestral state for each class or phylum.

    Based on this protocol and the guidelines mentioned above, we have reconstructed the ancestral states of various protein families along with their orthologs in a ‘deep’ phylogenetic space, across multiple kingdoms. We demonstrated how this method was implemented for proteins that are well-defined with known domains, novel proteins with unknown domains, poorly conserved domains and phylum/kingdom-specific proteins that (dis)appeared at various stages in evolution. This approach was successfully applied for the core proteins of the auxin signalling (Nuclear Auxin Pathway (NAP)) and biosynthesis pathways. NAP includes Auxin Response Factor (ARF), Auxin/Indole-3-Acetic-Acid (Aux/IAA) and Transport Inhibitor Response 1/Auxin-signalling F-Box (TIR1/AFB; Mutte et al., 2018). Biosynthesis pathway proteins include TAA family of amino transferase (TAA) and YUCCA family of monooxygenases (YUC). It was also applied to the individual domains, Phox and Bem1 (PB1; Mutte and Weijers et al., 2020), along with various downstream targets of the auxin pathway, such as SOSEKI (SOK; van Dop et al., 2020), Target of MOnopteros 5 (TMO5) and its interaction partner Lonesome HighWay (LHW; Lu et al., 2020). Taken together, by following this protocol in combination with ever-growing high-quality sequence data, and leaping developments in the methods and algorithms in phylogenetics, reveal new evolutionary insights into our understanding of proteins and the crucial pathways.

Acknowledgments

The authors would like to thank the 1,000 plant transcriptomes (OneKP) and Marine Micro Eukaryotic Transcriptome Sequencing Project (MMETSP) consortiums for providing such a massive data resources for the scientific community. Efforts of all the authors are highly appreciated, who developed many extremely useful and efficient programs and algorithms for phylogenetics, and making them freely accessible to the scientific community.

Competing interests

The authors declare no conflicts of interest.

References

  1. Bailey, T. L., Boden, M., Buske, F. A., Frith, M., Grant, C. E., Clementi, L., Ren, J., Li, W. W. and Noble, W. S. (2009). MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37(Web Server issue): W202-208.
  2. Burki, F., Roger, A. J., Brown, M. W. and Simpson, A. G. B. (2019). The new tree of Eukaryotes. Trends Ecol Evol 35(1):43-55.
  3. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K. and Madden, T. L. (2009). BLAST+: architecture and applications. BMC Bioinformatics 10: 421.
  4. Carpenter, E. J., Matasci, N., Ayyampalayam, S., Wu, S., Sun, J., Yu, J., Jimenez Vieira, F. R., Bowler, C., Dorrell, R. G., Gitzendanner, M. A., Li, L., Du, W., K, K. U., Wickett, N. J., Barkmann, T. J., Barker, M. S., Leebens-Mack, J. H. and Wong, G. K. (2019). Access to RNA-sequencing data from 1,173 plant species: The 1,000 Plant transcriptomes initiative (1KP). Gigascience 8(10).
  5. Darriba, D., Posada, D., Kozlov, A. M., Stamatakis, A., Morel, B. and Flouri, T. (2020). ModelTest-NG: a new and scalable tool for the selection of DNA and protein evolutionary models. Mol Biol Evol 37(1):291-294.
  6. Guindon, S., Dufayard, J. F., Lefort, V., Anisimova, M., Hordijk, W. and Gascuel, O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59(3): 307-321.
  7. Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J., Couger, M. B., Eccles, D., Li, B., Lieber, M., MacManes, M. D., Ott, M., Orvis, J., Pochet, N., Strozzi, F., Weeks, N., Westerman, R., William, T., Dewey, C. N., Henschel, R., LeDuc, R. D., Friedman, N. and Regev, A. (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 8(8): 1494-1512.
  8. Jones, P., Binns, D., Chang, H. Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A. F., Sangrador-Vegas, A., Scheremetjew, M., Yong, S. Y., Lopez, R. and Hunter, S. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics 30(9): 1236-1240.
  9. Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. and Jermiin, L. S. (2017). ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods 14(6): 587-589.
  10. Katoh, K. and Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30(4): 772-780.
  11. Keeling, P. J., Burki, F., Wilcox, H. M., Allam, B., Allen, E. E., Amaral-Zettler, L. A., Armbrust, E. V., Archibald, J. M., Bharti, A. K., Bell, C. J., Beszteri, B., Bidle, K. D., Cameron, C. T., Campbell, L., Caron, D. A., Cattolico, R. A., Collier, J. L., Coyne, K., Davy, S. K., Deschamps, P., Dyhrman, S. T., Edvardsen, B., Gates, R. D., Gobler, C. J., Greenwood, S. J., Guida, S. M., Jacobi, J. L., Jakobsen, K. S., James, E. R., Jenkins, B., John, U., Johnson, M. D., Juhl, A. R., Kamp, A., Katz, L. A., Kiene, R., Kudryavtsev, A., Leander, B. S., Lin, S., Lovejoy, C., Lynn, D., Marchetti, A., McManus, G., Nedelcu, A. M., Menden-Deuer, S., Miceli, C., Mock, T., Montresor, M., Moran, M. A., Murray, S., Nadathur, G., Nagai, S., Ngam, P. B., Palenik, B., Pawlowski, J., Petroni, G., Piganeau, G., Posewitz, M. C., Rengefors, K., Romano, G., Rumpho, M. E., Rynearson, T., Schilling, K. B., Schroeder, D. C., Simpson, A. G., Slamovits, C. H., Smith, D. R., Smith, G. J., Smith, S. R., Sosik, H. M., Stief, P., Theriot, E., Twary, S. N., Umale, P. E., Vaulot, D., Wawrik, B., Wheeler, G. L., Wilson, W. H., Xu, Y., Zingone, A. and Worden, A. Z. (2014). The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP): illuminating the functional diversity of eukaryotic life in the oceans through transcriptome sequencing. PLoS Biol 12(6): e1001889.
  12. Lanfear, R., Calcott, B., Ho, S. Y. and Guindon, S. (2012). Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Mol Biol Evol 29(6): 1695-1701.
  13. Letunic, I. and Bork, P. (2019). Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res 47(W1): W256-W259.
  14. Levy, S. E. and Myers, R. M. (2016). Advancements in next-Generation sequencing. Annu Rev Genomics Hum Genet 17: 95-115.
  15. Li, F. W., Villarreal, J. C., Kelly, S., Rothfels, C. J., Melkonian, M., Frangedakis, E., Ruhsam, M., Sigel, E. M., Der, J. P., Pittermann, J., Burge, D. O., Pokorny, L., Larsson, A., Chen, T., Weststrand, S., Thomas, P., Carpenter, E., Zhang, Y., Tian, Z., Chen, L., Yan, Z., Zhu, Y., Sun, X., Wang, J., Stevenson, D. W., Crandall-Stotler, B. J., Shaw, A. J., Deyholos, M. K., Soltis, D. E., Graham, S. W., Windham, M. D., Langdale, J. A., Wong, G. K., Mathews, S. and Pryer, K. M. (2014). Horizontal transfer of an adaptive chimeric photoreceptor from bryophytes to ferns. Proc Natl Acad Sci U S A 111(18): 6672-6677.
  16. Lu, K. J., van 't Wout Hofland, N., Mor, E., Mutte, S., Abrahams, P., Kato, H., Vandepoele, K., Weijers, D. and De Rybel, B. (2020). Evolution of vascular plants through redeployment of ancient developmental regulators. Proc Natl Acad Sci U S A 117(1): 733-740. 
  17. Mutte, S. K., Kato, H., Rothfels, C., Melkonian, M., Wong, G. K. and Weijers, D. (2018). Origin and evolution of the nuclear auxin response system. Elife 7: e33399.
  18. Mutte, S.K., Weijers, D. (2020). Deep Evolutionary History of the Phox and Bem1 (PB1) Domain Across Eukaryotes. Sci Rep 10: 3797.
  19. Nguyen, L. T., Schmidt, H. A., von Haeseler, A. and Minh, B. Q. (2015). IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32(1): 268-274.
  20. One Thousand Plant Transcriptomes, I. (2019). One thousand plant transcriptomes and the phylogenomics of green plants. Nature 574(7780): 679-685. 
  21. Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D. L., Darling, A., Höhna, S., Larget, B., Liu, L., Suchard, M. A. and Huelsenbeck, J. P. (2012). MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 61(3): 539-542.
  22. Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9): 1312-1313. 
  23. Stark, R., Grzelak, M. and Hadfield, J. (2019). RNA sequencing: the teenage years. Nat Rev Genet 20(11): 631-656.
  24. van Dop, M., Fiedler, M., Mutte, S., de Keijzer, J., Olijslager, L., Albrecht, C., Liao, C-Y., Janson, M., Bienz, M., and Weijers, D. (2020). A conserved biochemical paradigm underlies cell polarity across multicellular kingdoms. Cell (in press).
  25. Wang, B., Tseng, E., Regulski, M., Clark, T. A., Hon, T., Jiao, Y., Lu, Z., Olson, A., Stein, J. C. and Ware, D. (2016). Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat Commun 7: 11708.
  26. Waterhouse, A. M., Procter, J. B., Martin, D. M., Clamp, M. and Barton, G. J. (2009). Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics 25(9): 1189-1191.
  27. Wickett, N. J., Mirarab, S., Nguyen, N., Warnow, T., Carpenter, E., Matasci, N., Ayyampalayam, S., Barker, M. S., Burleigh, J. G., Gitzendanner, M. A., Ruhfel, B. R., Wafula, E., Der, J. P., Graham, S. W., Mathews, S., Melkonian, M., Soltis, D. E., Soltis, P. S., Miles, N. W., Rothfels, C. J., Pokorny, L., Shaw, A. J., DeGironimo, L., Stevenson, D. W., Surek, B., Villarreal, J. C., Roure, B., Philippe, H., dePamphilis, C. W., Chen, T., Deyholos, M. K., Baucom, R. S., Kutchan, T. M., Augustin, M. M., Wang, J., Zhang, Y., Tian, Z., Yan, Z., Wu, X., Sun, X., Wong, G. K. and Leebens-Mack, J. (2014). Phylotranscriptomic analysis of the origin and early diversification of land plants. Proc Natl Acad Sci U S A 111(45): E4859-4868.
  28. Yerramsetty, P., Stata, M., Siford, R., Sage, T. L., Sage, R. F., Wong, G. K., Albert, V. A. and Berry, J. O. (2016). Evolution of RLSB, a nuclear-encoded S1 domain RNA binding protein associated with post-transcriptional regulation of plastid-encoded rbcL mRNA in vascular plants. BMC Evol Biol 16(1): 141.
  29. Yoshida, S., van der Schuren, A., van Dop, M., van Galen, L., Saiga, S., Adibi, M., Moller, B., Ten Hove, C. A., Marhavy, P., Smith, R., Friml, J. and Weijers, D. (2019). A SOSEKI-based coordinate system interprets global polarity cues in Arabidopsis. Nat Plants 5(2): 160-166.

简介

[摘要] 系统发育是一个重要的领域的进化生物学认为有助于以理解的起源和分化的基因,基因组和种类。建立有意义的系统发育树被需要用于在精确重建的的过去。为了达到一个正确的系统发育的了解的基因 或蛋白质,可靠和稳健的方法都需要以构建有意义的树木。随着基因组和转录组测序数据的迅速增加,需要有效且准确的祖先状态重建方法。当前可用的方法大多是特定于某些基因家族的,并且需要对其进行实质性适应以应用于其他基因家族。因此,通用框架对于利用大型转录组资源(例如OneKP 和MMETSP)至关重要。在这里,我们基于核心优势开发了一种灵活而有效的方法,例如强调包括在同系物选择中,并基于多层推断定义直系同源物。我们说明了如何修改特定步骤以适应正在考虑的任何蛋白质家族的需求。我们还通过研究和测试各种基因家族的直向同源物,证明了该方案的成功。综上所述,我们提出了一种使用数千个转录组重建跨多个真核生物王国的各个域和蛋白质的祖先状态的协议。

[背景] 系统进化树是根本,以了解在进化的基因,基因家族,种类,门类和甚至王国。他们帮助,以描绘的多样性,并同时解决了分歧,在不同的层次。对于例如,在蛋白质水平,他们帮助我们,以确定直系群体基础上个氨基酸的差异跨不同的物种。此前,进化树是构建基于对几个基因/蛋白序列从有限数量的物种。随着对不断增长的测序数据,为更多的和更多的基因组和转录组正在日益接近,有是巨大的潜在的如 ,发现的新的谱系,“间隙填充” 在系统发育和因此,一个改进的理解的生物学(利维和斯堡,2016 ; Burki 等。人,2019) 。

在对过去的十年中,许多努力已经被做对定义的转录组的数百个(或甚至数千)的品种,由于到了普及的转录组测序(斯塔克等人,2019) 。转录提供了一个快速洞察到的(表达)的基因内容的一个基因组。甚至虽然在个别转录做不覆盖的整个基因含量的一个有机体,组合它们从多个细胞,组织和条件下,可包括的大多数的所述转录基因的那种。因此,它是一个相对简单的方法,以顺序和装配一个转录,而不 一个先验 知识的的基因组中。在当前天的长读和单细胞RNA测序技术,使得它甚至更容易以建立一个完整的转录组(王等人,2016) 。利用这些技术的进步,2个大的转录组测序项目,1000个植物转录组(OneKP ; 卡彭特等人,2019; 一个千植物转录倡议,2019 )和海洋微型真核细胞转录组测序项目(MMETSP; 基林等。人,2014 ),被开发。OneKP 代表了大多数的的土地植物和藻类群体,而MMETSP 覆盖广大的的SAR 组及其他(身份不明)门类中色藻界。

从他们开始,不同的方法已被开发并应用到这些转录和估计的祖先状态的各种基因跨越多个类,家庭和甚至门类(李等人,2014; Wickett 等。人,2014; Yerramsetty 等。人,2016)。在大多数的这些方法着眼于一个基因家族,并需要大量的修改,在方法论来运用他们对其他的基因家族。此外,该方法使用的是既不包含也不健壮在术语的多层的推论。该同源推论是基于上只有一个证据,最佳双向命中或蛋白质结构域或简单的系统发育基础上的几个基因组。为了克服这些缺点,我们开发了一个统一的框架来构建高分辨率的系统发育是利用了丰富的OneKP 和MMETSP 转录资源。这种新方法是不是唯一的包容性,而且还采用多层直向来解释系统发育与高信心,导致到该标识的新(子)类的直系同源基因。



概述中的协议

在当前的协议被开发,以重建祖先状态和高分辨率的系统发育树的各个基因家族使用转录和/或蛋白质组。祖状态表示的最小基因补体在每个进化节点,其中种特异性基因复制和(或)的损失将已修改的基因补体在个别物种。因此,选择了正确的,直向同源作为以及作为多种多样的,序列是一个关键的步骤在这样的一个深系统发生树的结构。这个协议是内置于3个核心优势:(1)包:包含多个序列在所述启动用自由参数,并删除序列作为一个去通过各种步骤中的管道,从而导致在一个高品质的逻辑序列集合用于系统发生树结构。(2)多层:多层次的直向确认,即,基于在该领域的架构,倒数BLAST 和在系统发育树。(3)鲁棒:否限制上长度的的蛋白质或该数量的序列用于在所述系统发育,与建议上交替分析软件包测试在各个步骤。总体而言,该协议包括14 步其被划分成3个部分:同源识别(步骤1-5),直系同源物的检测(步骤6-8)和系统发育结构(步骤9-14)。所有的一般参数和建议,为在各步骤的指示如下。

关键字:系统发育基因组学, OneKP, MMETSP, 植物, 系统发育, 演化, 转录组

设备


 


Linux机器
ç 动态数值设置:在软件部分仅在Linux环境下运行的程序提到多数; 因此,建议在可访问BASH shell(终端)的Linux机器上执行分析。在具有64 GB RAM和8核处理器设置的通用Linux工作站上,执行基因家族分析所需的平均时间为1-1.5天。该分析所需的磁盘空间小于1 GB。


 


小号oftware


 


来自BLAST + 模块v2.9.0的tblastn 和blastp (Camacho et al。,2009)(ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
faSomeRecords :来自UCSC的Linux 二进制文件(http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/)
TransDecoder V5.5.0 (哈斯等人,2013年)(transdecoder.github.io)
MEME 基序发现V5.1.0 (贝利等人,2009年)(http://meme-suite.org/)
ScanProsite 网络工具(https://prosite.expasy.org/scanprosite)
InterProScan v5.38-76.0 (Jones 等人,2014)(https://github.com/ebi-pf-team/interproscan)
MAFFT v7 (Katoh 和Standley,2013)(https://mafft.cbrc.jp/alignment/software/)
JalView (Waterhouse 等人,2009年)(https://www.jalview.org/)
ModelFinder (Kalyaanamoorthy 等人,2017) (一ccessed 作为内置模块从IQ-TREE)
ModelTest -NG (Darriba 等人,20 20 )(https://github.com/ddarriba/modeltest)
PartitionFinder v2 (Lanfear et al。,2012)(http://www.robertlanfear.com/partitionfinder/)
IQ-TREE v1.6.12 (Nguyen 等人,2015)(http://www.iqtree.org)
RAxML v8 (Stamatakis,2014)(https://cme.h-its.org/exelixis/web/software/raxml/index.html)
PHYML V3.3 (金敦等人,2010) (https://github.com/stephaneguindon/phyml)
MrBayes v3.2.7 (Ronquist et al。,2012)(https://github.com/NBISweden/MrBayes)
iTOL v4 (Letunic和Bork,2019)(https://itol.embl.de)
Linux BASH shell(终端)的'cut,sort and uniq '函数(https://tiswww.case.edu/php/chet/bash/bashref.html)
可通过GitHub(https://github.com/sumanthmutte/Phylogenomics)获得用于自动化协议中某些步骤的脚本
 


数据


OneKP 数据集(1000个植物转录组项目):包含来自1179种物种的1341个转录组,涵盖所有主要类别的陆地植物,绿藻,红藻和青生植物(Carpenter 等人,2019年;一千个植物转录组倡议,2019年); http://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/oneKP_capstone_2019
MMETSP数据集(海洋微生物真核生物转录组测序项目):包含410个物种的678个转录组,涵盖Stramenopila 和Alveolata的所有主要类别以及许多未分类的(单细胞)海洋真核生物(Keeling 等人,2014);https://gold.jgi.doe.gov/study?id=Gs0128947
 


程序


 


命令使用的,沿与所述参数用于在每个步骤的的协议,与步骤编号对应于图1 中给出的下面。前开始的协议,我们先创建一个BLAST 数据库为每个转录或蛋白质。这被进行了仅一次为每一个转录或蛋白质组使用的makeblastdb 功能,其中“ -in ” 需要一个FASTA 文件的的转录,或在蛋白质组和“ - DBTYPE ” 是所述数据库类型具有NUCL 和PROT 为转录和蛋白质组,分别。


 


              $ makeblastdb - DBTYPE NUCL -in transcriptome.fasta


$ makeblastdb - DBTYPE PROT -in proteome.fasta


 


图1.方法示意图,显示了用于直系同源物鉴定和系统树构建的方案的各个步骤。圆圈的数字对应于所述各个步骤的所述协议作为指示在该程序。程序/软件/算法使用被指示的下一个到所述箭头在灰色。文件格式的文本和FASTA 被描绘为显示在传奇。


 


同源鉴定
为了执行一个BLAST 搜索到的相应的(多个)数据库,我们创建了一个查询蛋白序列文件(在FASTA 格式),与序列从(相对)以及注解的基因组和从一个不同范围的物种,如果存在,跨越多个王国。一个列表的各个品种使用以及用一个链接到的序列数据资源是可以在附录1。
使用第Ë 查询序列文件(-query)进行BLAST搜索用TBLASTN 和BLASTP 模块,对转录组和蛋白质组数据库( - 分贝)表示。当吨他E值切-邻FF( - 安勤)我S小于0.01 ,保存在一个制表符分隔的文本文件的输出(-out)用所示- 指定outfmt 6。该参数的其余部分都保持在默认设置。
              $ TBLASTN -query filename.fa - 分贝transcriptome.fasta -out output.blast - 安勤0.01 - 指定outfmt '6 qseqid sseqid SLEN QSTART QEND SSTART 发送安勤bitscore 得分长度pident nident 阳性的PPO 失配间隙的帧qcovs qcovhsp SSEQ '


              $ BLASTP -query filename.fa - 分贝proteome.fasta -out output.blast - 安勤0.01 - 指定outfmt '6 qseqid sseqid SLEN QSTART QEND SSTART 发送安勤bitscore 得分长度pident nident 阳性的PPO 失配间隙的帧qcovs qcovhsp SSEQ '


 


所述BLAST 输出包含所有的得分信息有关的主题(转录物/蛋白质)序列其具有一个相似于所述对应的查询序列。到检索的对象序列标识符从所述BLAST 输出,我们已经使用的“剪切” ,“排序” 和“ uniq的” 功能的一个Linux的BASH 壳(终端)。“ 切割” 取的BLAST 输出(output.blast )从所述先前的步骤,和以第二柱(-f2) ,即,主题序列标识符和发送/管他们(|)到所述“排序” 功能。后分选,它们被传递上,以在“ uniq的” 功能以除去所述重复和所述输出被写入到该文件(SubjectIdentifiers.txt)。
 


              $ cut -f2 output.blast | 排序| uniq > SubjectIdentifiers.txt


 


使用这些标识符(SubjectIdentifiers.txt)提取相应转录物(SelectedTranscripts.fasta )或蛋白质序列(SelectedProteins.fasta 从相应的转录)或蛋白质组由运行在“ faSomeRecords ” 程序。
 


              $ faSomeRecords transcriptome.fasta SubjectIdentifiers.txt SelectedTranscripts.fasta


              $ faSomeRecords proteome.fasta SubjectIdentifiers.txt SelectedProteins.fasta


 


该蛋白质序列是由于更多的信息到更高数量的位点的模式和可以被直接用来对系统发育结构。然而,应使用具有默认设置的程序TransDecoder 将转录物序列翻译为蛋白质序列。首先,通过TransDecoder.LongOrfs 确定转录本的最长开放阅读框(长度至少为100个氨基酸的ORF)。然后CDS和这些ORF的相应氨基酸序列thorugh TransDecoder.Predict 。如果基于蛋白质序列的树导致较差的自举,我们建议使用CDS(DNA)序列生成树。
 


              $ perl TransDecoder.LongOrfs -t SelectedTranscripts.fasta


              $ perl TransDecoder.Predict -t SelectedTranscripts.fasta


直系同源物检测
并非所有E值<0.01的序列都是查询蛋白的真实直系同源物。因此,需要额外的过滤器以去除非直系同源物。一种这样的过滤器是相同的结构域的存在小号在直向同源蛋白。对于某些注释良好的蛋白质(例如,Auxin响应因子,激酶等),可以在InterPro 域数据库中轻松获得域信息。扫描亲来自前面步骤TEIN序列( - 我SelectedProteins.fasta ),用于使用公知的结构域的存在InterProScan的工具(interproscan.sh),其产生的制表符分隔(TSV)文件以及HTML / XML文件( - TSV,HTML,XML),并在每个蛋白质序列中标识了所有域以及相应的InterPro 标识符(-iprlookup )。开发了一个Python脚本(InterproscanSummary.py)来处理该TSV文件,以提取具有目标域的最终蛋白质序列集(更多信息,请参见GITHUB页)。InterProScan 是一个耗时的步骤,因此为了节省时间,我们在可用时使用了预先注释的数据,或减少了扫描数据库的数量(使用-appl Pfam 和CDD 设置)。在某些情况下,我们将数据分成较小的批次,然后在多个处理器上运行。
 


              $ interproscan.sh -f TSV ,HTML,XML - iprlookup - i SelectedProteins.fasta


              $ python InterproscanSummary.py


 


某些蛋白质(例如克,漱石在拟南芥; 吉田等人,2019 )缺乏注释的(功能性)域信息。使用的MEME 程序吨ö预测的保守基序/与零个或一个的出现每个序列的标准(-mod那些蛋白质结构域zoops )和10的最小宽度( - minw ),每组预测最多10个基序( - nmotifs )。的MEME 输出基序与它们在HTML / TEXT格式图案沿。然后,使用T HESE 主题的图案在ScanProsite Web的工具,以确定该领域中的蛋白质序列是根本不具有注释域。我们已经应用这种方法成功地以注释的漱石蛋白家族和确定它的直系同源基因(面包车DOP 等,2020 )。
 


              $ 米姆ProteinSequences.fa -o OutputName -蛋白质-mod zoops - nmotifs 10 - minw 10


 


选择有感兴趣的领域的蛋白质序列后,他们被查询回到中使用的物质的蛋白质组步骤A1 以确认使用最好的双向BLAST命中(BBH)的策略直系关系。这里,我们使用最大靶序列的选择或最好命中在输出数- (max_target_seqs 有时2时域是丰富的基因组(为)到1组,或例如,的bHLH )0.01,用E值< (-evalue )。被考虑中的蛋白质击中的最后一组蛋白质被认为是“真正的”直系同源蛋白质,需要进一步分析。与步骤A2 (-outfmt 6)相同,输出记录在TSV文件中。
 


              $ BLASTP -query filename.fa - 分贝ArabidopsisProteome.fasta -out BBhits.blastp - max_target_seqs 1 - 安勤0.01 - 指定outfmt '6 qseqid sseqid SLEN QSTART QEND SSTART 发送安勤bitscore 得分长度pident nident 阳性的PPO 失配间隙的帧qcovs qcovhsp SSEQ '


 


系统发育施工
邻这些“真实”套rthologs 被用于对准其次是系统进化树。MAFFT 我š用于对准的蛋白质序列。的E-INS- 我(- genafpair )算法被同时对准蛋白具有多个域分离通过保守性差的序列(使用例如,ARF或AUX / IAA蛋白),而G-INS- 我(- globalpair )被使用,而仅比对结构域特异性序列(例如,PB1结构域)。迭代细化方法是在两种情况下使用的,具有最大的1000次迭代(- maxiterate 千),之后,最终对准被写入到一个文件FASTA(OUTPUT_FILE )。
 


              $ MAFFT - genafpair - maxiterate 1000 INPUT_FILE > OUTPUT_FILE


              $ MAFFT - globalpair - maxiterate 1000 INPUT_FILE > OUTPUT_FILE


 


一旦比对被产生,使用的trimAl 与多于50间%-80%的间隙删除序列位置(列),因为它们被认为缺乏系统发育信号。因此,对于系统树的构建,仅使用没有虚假间隔的序列。的0.2的间隙阈值( - GT 0.2),被设置以删除所有位置中的对准与间隙在80% (或更多)的所述序列。对于该基因家族的是已适度保守结构域(例如,ARF,AUX / IAA),使用一个阈值的0.3 或0.4,而对于差的保守结构域(例如,PB1)它被设定在0.2,并且为高度保守的蛋白质(例如,ROP,ROPGEF)它被设置之间0.6 和0.8。一个附加的(可选的)检查被保持在的地方,其中所述的序列即是短比1/4 个的所述平均序列长度被进一步除去在JalView 。
注:有是各种工具,专门用于在清理中的定位,这样的GBlocks ,指导,AliScore ,佐罗等。然而,一个简单的间隙基于微调在trimAl 导致在(几乎)在相同质量的校准和树形拓扑当相比于这些专业工具。因此,摆在我们编trimAl 用于对准清理整个这项研究。


 


              $ trimal -in inputfile.fa -out outputfile.fa - fasta - gt 0.2


 


然后我们è 这种“干净”的定位来确定的最合适的模式的演变对于每个蛋白质家族。ModelFinder 和MODELTEST -NG 被使用于预测的最佳模型基础上的Akaike- 和Bayesian- 信息准则(AIC 和BIC)。对于在广大的的蛋白质家族,两个方案提供的相同型号的的最佳模式。的情况,其中有是一个不匹配之间的2个程序,使用一个第三个节目(无论是PartitionFinder 或一个Perl的脚本,从RAxML 分布)来决定对在最佳模型基础上的多数统治。正如预期,各种蛋白质进化方式不同,导致对不同车型的演变。ModelFinder 是运行作为一个部分的IQ-TREE,因此它并没有要求任何额外的步骤。MODELTEST -NG 需要小号的类型(无论是氨基酸或核苷酸-d)的输入数据集( - 我INFILE)和写入所述统计和所述最佳模型到所述输出文件(-o OUTFILE)。PartitionFinder 要求的对准,在所述PHYLIP 格式(代替的FASTA 格式如在其他)放置在该文件夹“ partition_finder_models ”,其中所述输出统计和最好的模型是也记录下来。FASTA 到PHYLIP 格式转换可以被制成通过所述的Perl 脚本(fasta2relaxedPhylip.pl),这需要输入FASTA (-f input.fa )和写入的输出在PHYLIP 格式(-o output.phylip )。
 


$ modeltest -ng -d aa- i INFILE -o OUTFILE


$ perl RAxML_ProteinModelSelection.pl alignment.fasta


$ perl fasta2relaxedPhylip.pl -f 输入.fa -o 输出.phylip


$ python PartitionFinderProtein.py partition_finder_models


 


Phylogene 抽动树被建立主要使用IQ-TREE 和RAxML 基于上的“干净” 对准产生在步骤C 10 和所述进化模型预测在步骤C 11 。对于该系统发育树制成通过IQ-TREE ,我们已经使用1000个快速引导程序(-BB 1000)和SH-像近似似然比检验( - ALRT 1000),结合与自动模式的发现通过ModelFinder (-m MFP + MERGE)。对于该树由具有RAxML ,我们都还用快速引导和最大似然搜索中的相同的运行(-f A) ,但有一个扩展的多数规则( - #autoMRE )基于bootstopping 标准。在此外,我们得到一个随机种子数(-x 和-p)到接通快速引导和简约性推理,而-m 需要在所述模型从所述先前的小号TEPÇ 11 。˚F 或树木与非常差引导支撑为大多数的所述分支,我们使用另一个系统发生树结构程序,PHYML ,用100个自举重复(-b 100),经验氨基酸频率(-f e)中,伽马形状参数估计从最大似然(-a E)和该拓扑被搜查基础上的子树剪枝和重新嫁接方法(-s SPR)。后运行这些多个程序,该树得到了比较,以了解在整体拓扑基础上的一致分支(见下一步骤)。我们都还尝试和测试各种贝叶斯方法(使用MrBayes ),但该树从来没有收敛,即使后几个月的计算,并提供各种不一致的拓扑结构。因此,所有的分析都进行与最大似然方法。
 


$ iqtree -s CleanAlignment.fa -pre OutputName - alrt 1000 -bb 1000 -m MFP + MERGE


              $ raxmlHPC-PTHREADS-AVX2 -fa -x 12345 -p 12345 -j- #autoMRE -m PROTGAMMAJTT -s CleanAlignment.fa -n OutputName


              $ PhyML-3.1_linux64- i CleanAlignment.fa -d aa -b 100 -m JTT -fe -s SPR -ae


 


使用iTOL Web服务器可视化所有最终的系统发育树,然后查看系统发育树上的各种数据集。产生p rotein从域信息InterProScan的或MEME ,从序列长度TransDecoder 从和进化枝/分类信息OneKP 和MMETSP数据库下列中提供的说明iTOL 文档。
一旦树木被获得,他们是手动进行错误检查。手动删除的B 与牧场升翁分支的吸引,或部分序列或任何错位的类群。如果这些放错位置的分支比例太高,重新ANALY Ž e为来自其它物种的多个序列的系统发育,以及通过去除杂散序列。这些步骤被重复,直到获得更好的进化树是被不仅受到良好支撑白手起家,但也遵循这些门类的分类。
 


局限性和结论


由于到所述广义性质的的方法,它是难以以自动化的完整协议。因此,在任何可能的情况下,都使用专用于快速和并行处理的脚本/命令简化了该方法。在对其他另一方面,它给控制了在决策过程中基于对的蛋白质下考虑。当处理与高度冗余的蛋白家族,我们除去高度相似的蛋白质(> 90%的相似性),现有至系统发育,这减少了(计算)时间而不损失精度。在许多情况下,我们观察到的是在最佳命中在倒数-BLAST 是不是真的一个BBH,作为有时一个第二命中是静止的最好的一个,由于到一个或几个氨基酸差异(一个或多个)(特别是在蛋白质与公共域例如bHLH结构或PB1)。因此,在这些情况下,我们考虑了两个最佳命中点,并将两者都用于系统发育构建。的假阳性的直系同源物进行最终放置在所述外类群(或在至少分离从所述内部团体)中的系统发育树。由于我们是处理与转录,我们可能无法预测的实际基因拷贝数在每一个品种,但只有将祖先拷贝数为那类或门,通过比较的祖先副本跨越了大部分的的物种在那个门。另一个问题的处理与(低深度)转录为的是我们发现了许多局部的成绩单领先到了截短蛋白/域,或者我们可能会失败,以确定的成绩单说是不是表示在那个特定的组织或条件。在这方面,结合同源基因序列信息从多个转录或种类的各种家庭是必须要确认的祖先状态为每个类或门。


  基于此协议和上述指南,我们在“王国”的“系统发育”空间中跨多个王国重建了各种蛋白质家族的祖先状态及其直系同源物。我们证明了如何对已知结构域定义明确的蛋白质,未知结构域定义的新型蛋白质,保守性较差的结构域以及在进化的各个阶段(消失)出现的门/王国特异性蛋白质实施该方法。该方法已成功应用于生长素信号转导的核心蛋白(核生长素途径(NAP))和生物合成途径。NAP包括生长素响应因子(ARF),生长素/吲哚-3-乙酸(Aux / IAA)和转运抑制剂响应1 /辅助信号F-Box(TIR1 / AFB; Mutte 等,2018 )。生物合成途径蛋白包括TAA家族的氨基转移酶(TAA)和YUCCA家族的单加氧酶(YUC)。它也被应用于单个域Phox 和Bem1(PB1 ;Mutte 和Weijers 等,2020 ),以及生长素途径的各种下游靶标,例如SOSEKI(SOK; van Dop 等,2020),MOnopteros 5(TMO5)的目标及其交互伙伴Lonesome HighWay (LHW; Lu 等人,2020)。综上所述,通过遵循此协议并结合不断增长的高质量序列数据,以及系统发育学方法和算法的飞跃发展,对我们对蛋白质和关键途径的理解揭示了新的进化见解。


 


Acknowledg 发言:


 


该作者将喜欢到感谢的1 ,000 植物转录组(OneKP )和海洋微型真核转录组测序项目(MMETS P)财团为科学界提供了这样一个庞大的数据资源。所有作者的努力都高度赞赏,谁制定了系统发育许多非常有用和有效的程序和算法,并让他们对科学界可以自由进出。


 


利益争夺


 


作者宣称没有利益冲突。


 


参考文献


 


Bailey,TL,Boden,M.,Buske ,FA,Frith ,M.,Grant,CE,Clementi,L.,Ren,J.,Li,WW and Noble,WS(2009)。MEME SUITE:用于发现和搜索主题的工具。核酸研究37(Web服务器问题):W202-208。
Burki ,F.,Roger,AJ,Brown,MW和Simpson,AGB(2019)。真核生物的新树。趋势Ecol Evol 35(1):43-55。
卡马乔,C.,Coulouris ,G.,Avagyan,V.,马,N.,普洛斯,J.,Bealer ,K.和Madden,TL(2009)。BLAST + :体系结构和应用程序。BMC生物信息学10:421。
卡彭特,EJ,Matasci ,N.,阿伊耶姆帕拉耶姆,S.,吴,S.,太阳,J.,玉,J.,希门尼斯·维埃拉,FR,保龄球,C.,Dorrell,RG,Gitzendanner ,马化腾,李彦宏,L 。,Du,W.,K,KU,Wickett ,NJ,Barkmann ,TJ,Barker,MS,Leebens -Mack,JH和Wong,GK(2019)。的1:从1173个植物物种获得RNA测序数据,000植物转录倡议(1KP)。Gigascience 8(10)。
Darriba ,D.,波萨达,D.,科兹洛夫,AM,Stamatakis,A.,莫瑞尔,B。和Flouri ,T.(20 20 )。ModelTest-NG:一种新的可扩展的工具,用于选择DNA和蛋白质进化模型。摩尔生物学埃沃升37(1):291-294 。
Guindon,S.,Dufayard ,JF,Lefort ,V.,Anisimova,M.,Hordijk ,W.和Gascuel ,O.(2010)。估计最大似然系统发育的新算法和方法:评估PhyML 3.0的性能。 Syst Biol 59(3):307-321。
Haas,BJ,Papanicolaou,A.,Yassour ,M.,Grabherr ,M.,Blood,PD,Bowden,J.,Couger ,MB,Eccles,D.,Li,B.,Lieber,M.,MacManes ,MD ,Ott,M.,Orvis,J.,Pochet ,N.,Strozzi ,F.,Weeks,N.,Westerman ,R.,William,T。,杜威,CN,Henschel,R.,LeDuc ,RD,Friedman ,N。和Regev,A.(2013)。使用Trinity平台从RNA-seq进行从头转录本序列重建,用于参考生成和分析。纳特Protoc 8(8):1494至1512年。
琼斯,P.,宾斯,D.,长安,HY,弗雷泽,M.,李,W.,McAnulla ,C.,McWilliam,H.,马斯兰,J.,米切尔,A.,努卡,G.,Pesseat ,S。,奎因,AF,Sangrador -Vegas,A.,Sheremetjew ,M.,Yong,SY,Lopez,R。和Hunter,S。(2014)。InterProScan 5:基因组规模的蛋白质功能分类。生物信息学30(9):1236-1240。
Kalyaanamoorthy ,S.,Minh,BQ,Wong,TKF,von Haeseler ,A.和Jermiin ,LS(2017)。ModelFinder:快速模型选择,可进行准确的系统发育估计。Nat Methods 14(6):587-589。
Katoh ,K.和Standley ,DM(2013)。MAFFT多序列比对软件版本7:性能和可用性方面的改进。Mol Biol Evol 30(4):772-780。
基林,PJ,Burki ,F.,Wilcox,HM,Allam,B.,Allen,EE,Amaral-Zettler ,LA,Armbrust ,EV,Archibald,JM,Bharti,AK,Bell,CJ,Beszteri ,B.,Bidle ,KD,Cameron,CT,Campbell,L.,Caron,DA,Cattolico ,RA,Collier,JL,Coyne,K.,Davy,SK,Deschamps,P.,Dyhrman ,ST,Edvardsen ,B.,Gates,RD ,Gobler ,CJ,Greenwood,SJ,Guida ,SM,Jacobi,JL,Jakobsen,KS,James,ER,Jenkins,B.,John,U.,Johnson,MD,Juhl ,AR,Kamp,A.,Katz, LA,Kiene ,R.,库德里亚夫采夫,A.,利安德,BS,林,S.,洛夫乔伊,C.,琳,D.,马尔凯蒂,A.,麦克马纳斯,G.,内德尔库,AM,Menden- Deuer ,S 。,Miceli,C.,Mock,T.,Montresor,M.,Moran,MA,Murray,S.,Nadathur ,G.,Nagai,S.,Ngam ,PB,Palenik ,B.,Pawlowski,J. 彼得罗尼,G.,Piganeau ,G.,Posewitz ,MC,Rengefors ,K.,罗马诺,G.,Rumpho ,ME,赖尼尔森,T.,希林,KB,施罗德,DC,辛普森,AG,Slamovits ,CH,史密斯,DR,史密斯,GJ,史密斯,SR,Sosik ,HM,Stief ,P.,Theriot,E.,Twary ,SN,Umale ,PE,Vaulot ,D.,Wawrik ,B.,Wheeler,GL ,Wilson,WH,Xu,Y.,Zingone ,A. and Worden,AZ(2014)。海洋微生物真核生物转录组测序项目(MMETSP):通过转录组测序阐明海洋中真核生物的功能多样性。PLoS Biol 12(6):e1001889。
Lanfear ,R.,Calcott ,B.,Ho,SY and Guindon,S.(2012年)。分区查找器:分区方案和替代模型的组合选择,用于系统发育分析。Mol Biol Evol 29(6):1695-1701。
Letunic ,I.和Bork,P.(2019)。交互式生命之树(iTOL)v4:最新更新和新发展。核酸Res 47(W1):W256-W259。
Levy,SE和Myers,RM(2016)。下一代测序的进步。Annu Rev Genomics Hum Genet 17:95-115。
李,FW,维拉利尔,JC,凯利,S.,Rothfels ,CJ,Melkonian ,M.,Frangedakis ,E.,Ruhsam ,M.,西格尔,EM,明镜,JP Pittermann ,J.,伯奇,DO,POKORNY ,L.,Larsson,A.,Chen,T.,Weststrand ,S.,Thomas,P.,Carpenter,E.,Zhang,Y.,Tian,Z.,Chen,L.,Yan,Z.,Zhu ,Y.,Sun,X.,Wang,J.,Stevenson,DW,Crandall- Stotler ,BJ,Shaw,AJ,Deyholos ,MK,Soltis,DE,Graham,SW,Windham,MD,Langdale,JA,Wong, GK,Mathews,S. and Pryer ,KM(2014)。自适应嵌合感光体从苔藓植物到蕨类的水平转移。PROC国家科科学院科学USA 111(18):6672-6677。
Lu KJ,van't Wout Hofland ,N.,Mor ,E.,Mutte ,S.,Abrahams,P.,Kato,H.,Vandepoele ,K.,Weijers ,D. and De Rybel ,B.(2020年) )。通过重新部署古代的发育调节剂来维管植物的进化。PROC国家科科学院科学USA 117(1):733-740。              
Mutte ,SK,加藤,H.,Rothfels ,C.,Melkonian ,M.,黄,GK和Weijers ,D。(2018)。核生长素反应系统的起源和进化。网上生活7:e33399 。
Mutte ,SK,Weijers ,D.(2020年)。跨真核生物的Phox和Bem1(PB1)域的深层进化历史。科学代表10 :3797。
Nguyen,LT,Schmidt,HA,von Haeseler ,A.和Minh,BQ(2015)。IQ-TREE:一种快速有效的随机算法,用于估计最大似然系统发育。Mol Biol Evol 32(1):268-274。
一千个植物转录组,I.(2019)。一千个植物转录组和绿色植物的系统进化组。自然574(7780):679-685。              
Ronquist ,F.,Teslenko ,M.,范德华马克,P.,艾尔,DL,达林,A.,ħ ö HNA ,S.,Larget ,B.,刘,L.,祖哈德· ,MA和胡森贝克,JP (2012)。MrBayes 3.2:在大型模型空间中进行有效的贝叶斯系统发育推断和模型选择。Syst Biol 61(3):539-542。
Stamatakis,A.(2014年)。RAxML版本8:用于大型系统发育分析和后分析的工具。生物信息学30(9):1312-1313。              
斯塔克,R.,Grzelak ,M。和哈德菲尔德,J。(2019)。RNA测序:十几岁。Nat Rev Genet 20(11):631-656。
面包车DOP ,M.,费德勒,M.,Mutte ,S.,德Keijzer ,J.,Olijslager ,L.,阿尔布雷希,C.,辽,CY,詹森,M.,Bienz安,M.,和Weijers , D.(2020年)。保守的生化范式是跨多细胞王国的细胞极性的基础。单元格(印刷中)。
Wang B.,Tseng,E.,Regulski ,M.,Clark,TA,Hon,T.,Jiao,Y.,Lu,Z.,Olson,A.,Stein,JC and Ware,D.(2016) 。通过单分子长读测序揭示了玉米转录组的复杂性。Nat Commun 7:11708。
沃特豪斯(AM),普罗克特(Procter),新山(JB),马丁(DM),马克(Clamp)和M. Jalview第2版-多序列比对编辑器和分析工作台。生物信息学25(9):1189-1191。
Wickett ,NJ,Mirarab ,S.,阮,N.,WARNOW ,T.,卡彭特,E.,Matasci ,N.,阿伊耶姆帕拉耶姆,S.,巴克,MS,伯利,JG,Gitzendanner ,MA,Ruhfel ,BR,Wafula ,E.,Der,JP,Graham,SW,Mathews,S.,Melkonian ,M.,Soltis,DE,Soltis,PS,Miles,NW,Rothfels ,CJ,Pokorny,L.,Shaw,AJ,DeGironimo , L.,Stevenson,DW,Surek ,B.,Villarreal,JC,Roure ,B.,Philippe,H.,dePamphilis ,CW,Chen,T.,Deyholos ,MK,Baucom ,RS,Kutchan ,TM,Augustin,MM ,王建,张玉,田中,严Z.,吴新。,孙X.,Wong,GK和Leebens -Mack,J.(2014)。陆地植物起源和早期多样化的植物转录组学分析。PROC国家科科学院科学USA 111(45):E4859-4868。
Yerramsetty ,P.,Stata,M.,Siford ,R.,Sage,TL,Sage,RF,Wong,GK,Albert,VA和Berry,JO(2016)。RLSB的进化,RLSB是一种核编码的S1域RNA结合蛋白,与维管植物中质体编码的rbcL mRNA的转录后调控相关。BMC Evol Biol 16(1):141。
吉田,S.,范德华Schuren ,A.,面包车DOP ,M.,面包车盖伦,L.,塞加,S.,艾迪比,M.,莫勒,B.,十霍夫,CA,Marhavy ,P.,史密斯,R.,Friml,J.和Weijers ,D.(2019)。基于SOSEKI的坐标系可解释拟南芥中的整体极性提示。Nat Plants 5(2):160-166。
登录/注册账号可免费阅读全文
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright Mutte and Weijers. This article is distributed under the terms of the Creative Commons Attribution License (CC BY 4.0).
引用: Readers should cite both the Bio-protocol article and the original research article where this protocol was used:
  1. Mutte, S. K. and Weijers, D. (2020). High-resolution and Deep Phylogenetic Reconstruction of Ancestral States from Large Transcriptomic Data Sets. Bio-protocol 10(6): e3566. DOI: 10.21769/BioProtoc.3566.
  2. Mutte, S. K., Kato, H., Rothfels, C., Melkonian, M., Wong, G. K. and Weijers, D. (2018). Origin and evolution of the nuclear auxin response system. Elife 7: e33399.
提问与回复

(提问前,请先登录)bio-protocol作为媒介平台,会将您的问题转发给作者,并将作者的回复发送至您的邮箱(在bio-protocol注册时所用的邮箱)。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。

当遇到任何问题时,强烈推荐您通过上传图片的形式提交相关数据。