搜索

Brief Protocol for EDGE Bioinformatics: Analyzing Microbial and Metagenomic NGS Data
EDGE生物信息学工具简要使用方法:分析微生物和宏基因组学下一代测序数据   

下载 PDF 引用 收藏 提问与回复 分享您的反馈 Cited by

本文章节

参见作者原研究论文

本实验方案简略版
Nucleic Acids Research
Jan 2017

Abstract

Next-generation sequencing (NGS) offers unparalleled resolution for untargeted organism detection and characterization. However, the majority of NGS analysis programs require users to be proficient in programming and command-line interfaces. EDGE bioinformatics was developed to offer scientists with little to no bioinformatics expertise a point-and-click platform for analyzing sequencing data in a rapid and reproducible manner. EDGE (Empowering the Development of Genomics Expertise) v1.0 released in January 2017, is an intuitive web-based bioinformatics platform engineered for the analysis of microbial and metagenomic NGS-based data (Li et al., 2017). The EDGE bioinformatics suite combines vetted publicly available tools, and tracks settings to ensure reliable and reproducible analysis workflows. To execute the EDGE workflow, only raw sequencing reads and a project ID are necessary. Users can access in-house data, or run analyses on samples deposited in Sequence Read Archive. Default settings offer a robust first-glance and are often sufficient for novice users. All analyses are modular; users can easily turn workflows on/off, and modify parameters to cater to project needs. Results are compiled and available for download in a PDF-formatted report containing publication quality figures. We caution that interpreting results still requires in-depth scientific understanding, however report visuals are often informative, even to novice users.

Keywords: Genomics (基因组学), Bioinformatics (生物信息学), Next-generation sequencing (下一代测序), Metagenomics (宏基因组学)

Background

EDGE bioinformatics was developed to help biologists rapidly process next-generation sequencing (NGS) data even if they have little to no bioinformatics expertise. EDGE is a highly integrated and interactive web-based platform that is capable of running many of the standard analyses that biologists require for viral, bacterial/archaeal, and metagenomic samples. EDGE provides an intuitive web-based interface for user input, allows users to visualize and interact with selected results, and generates a final detailed PDF report. Results in the form of tables, text files, graphic files, and PDFs, together with the raw output files of executed programs, can all be downloaded. A user management system allows tracking of an individual’s EDGE runs, along with the ability to share, post publicly, delete, or archive their results. Users can explore ongoing data processing within a user-friendly, intuitive web-based environment and interactive results are presented on a sample-by-sample basis. While EDGE was intentionally designed to be as simple as possible for the user, there is still no single ‘tool’ or algorithm that fits all use cases in the bioinformatics field. Our intent is to provide a detailed panoramic view of the user’s sample from various analytical standpoints. The initial release of EDGE in January 2017 provides six analytical workflows: pre-processing (data QC and host removal), assembly and annotation, reference-based analysis, taxonomy classification, phylogenetic analysis, and PCR analysis (validation and design). The latest release (version 1.5) includes several new features: identification of antimicrobial resistance and virulence genes, 16S/18S/fungal ITS analysis using QIIME, metadata collection/storage, and comparative analysis of taxonomic classification of multiple metagenomic samples. EDGE Bioinformatics is an ongoing effort to provide best of breed bioinformatics tools for NGS data analysis. Updates to current modules are continuous and more modules are under development.

Equipment

  1. Installing EDGE on a local server
    1. Hardware requirements: For high-throughput users that desire to process several large (50-500 million reads) samples at once, computers with 256 GB memory and 64 computing CPUs with 8 TB of local storage are highly recommended. The current computational hardware for one of the demonstration servers (https://bioedge.lanl.gov) is a Dell, PowerEdge R720 with 4 x Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz = 24 cores (48 threads) and 512 GB RAM
    2. System requirements: EDGE bioinformatics has been tested on a Linux server with Ubuntu 14.04 or CentOS 7 operating systems, and requires 64 bit Linux environments. EDGE will not natively run on Mac OS X but, if enough computational resources are available for analyses, a Dockerized version of EDGE could be installed on a Mac in a Linux environment
    3. Essential libraries and dependencies required prior to EDGE software install are detailed in step-by-step command line instructions at https://edge.readthedocs.io/en/v1.5/system_requirement.html
    4. To simplify installation, a Docker image or a VM in OVF is also available. Information and links are located at http://edge.readthedocs.io/en/v1.5/installation.html#edge-docker-image and http://edge.readthedocs.io/en/v1.5/installation.html#edge-vmware-ovf-image
    5. Useful Resources: A web-based video tutorial series describing how to set-up and run each EDGE module can be found at http://tutorial.getedge.org. Written documentation for software install and how to run EDGE is available at https://edge.readthedocs.io/en/v1.5/
  2. Demo versions of EDGE
    Los Alamos National Laboratory (LANL) and the Naval Medical Research Center (NMRC) host and/or support outward facing demo versions of EDGE bioinformatics for prospective users. Any computer with internet access can use the demo web-based EDGE bioinformatics platforms. To run analyses on provided test data, or samples deposited in the Sequence Read Archive (SRA), visit https://bioedge.lanl.gov. To upload your own data (maximum file size is 5 Gbp) and run EDGE, visit http://hobo-nickel.getedge.org

Software

  1. EDGE source code is open-source and can be located at https://github.com/LANL-Bioinformatics/EDGE/tree/v1.5
  2. FaQCs software (Lo and Chain, 2014)
  3. PhaME (Ahmed et al., 2015) software

Procedure

Note: The procedure described herein assumes a user has obtained access to a demo version of EDGE bioinformatics, or has installed EDGE locally. Users must be logged-in to an EDGE account to upload data, run analyses, or view past submissions. If you wish to install EDGE, step-by-step command line instructions are detailed https://edge.readthedocs.io/en/v1.5/installation.html. Feel free to contact any member of the development team directly, or visit our google group at edge-users@googlegroups.com. The procedure is outlined in the following sections: Accessing EDGE, Upload Data Files, Run EDGE Input Sample, Run EDGE Choose Processes, Run EDGE Job Submission, and Navigating Projects. A web-based video tutorial series describing how to set-up and run each EDGE module can be found at http://tutorial.getedge.org.

  1. Accessing EDGE
    Note: To run EDGE on any platform (i.e., in-house or demo versions), users are required to create an account and log-in. User permissions can be set to manage access levels to data and analyses.
    1. User Accounts & Log-in
      1. Open a web browser and access EDGE (see demo versions above, or access local version). Web addresses (URLs) will depend on the internal network configuration for locally installed versions.
      2. At the EDGE interface, click on the silhouette in the top right-hand corner. For new users, submit information for a new account. For returning users, sign-in using EDGE credentials.

  2. Upload data files
    Click on ‘Upload Files’ tab. Drag and drop files for upload, or click the ‘+ Add Files’ button. Select ‘Start Upload’ to complete. Note that the maximum file size for upload is 5 GB. This is configurable on a local installation of EDGE. Allowed file types include FASTQ, FASTA, GenBank, and test (txt, config, ini), and can be in gzip format. These files can be located in the MyUploads directory.

  3. Run EDGE: Input sample (see Figure 1)
    Note: EDGE parameter configurations are optimized for Illumina data. To run EDGE, sequence data files are needed in FASTQ format for a single sample.
    Click ‘Run EDGE’ to set-up and submit jobs to the EDGE bioinformatics pipeline. The first section, ‘Input Raw Reads’, requires users to provide a project name, and sequencing data at a bare minimum. EDGE accepts raw FASTQ files (single or paired-end), or Sequence Read Archive (SRA) accession numbers. Details for each setting are provided herein.


    Figure 1. Initiating a run in EDGE. Required input includes a unique project name and the location of sequencing reads to be processed/analyzed. Batch submission is possible and metadata collection is encouraged.

    1. Input Raw Reads
      1. Project Name
        Required field. Enter project name. Note that the same project name can be reused, hence unique identifiers are encouraged. There is a 3-character minimum and 30-character limit for the project name. Avoid using spaces, but dashes and underscores are acceptable.
      2. Description
        Optional field. Entry space to describe project/sample in more detail.
      3. Input from Sequencing Read Archive (SRA)
        Clickable Yes/No toggle that controls options for accessing location of input sequencing reads. The default setting is No; this requires users to use single- or paired-end FASTQ files from the EDGE Input Directory or Upload File directory. (See step C1d for instructions on how to proceed when set to No). When this toggle is set to Yes, reads are obtained from SRA accession numbers. Internet access is required to use the Yes option. Supported SRA accession formats include: studies (SRP*/ERP*/DRP*), experiments (SRX*/ERX*/DRX*), samples (SRS*/ERS*/DRS*), runs (SRR*/ERR*/DRR*), or submissions (SRA*/ERA*/DRA*).
      4. Sequencing Reads
        Required field. Default settings require users to indicate the direct path for data (see step C1c for SRA input). EDGE accepts sequence data files in FASTQ format; compressed files (.gz) are also acceptable. Both paired-end and single-end sequences are permissible. Absolute file paths are required to run EDGE. Users can click the round button to the right of the file path text box to access data from the Upload File directory, or other directory structures configured internally (i.e., directly linked to an in-house Illumina sequencer).
      5. Additional Options
        In most cases, the additional options can be ignored. If a user wants to add more input read files, or increase the CPUs for a job, this field provides those options. Clicking on this field will expose the following parameters:
        1. Add paired-end input: add absolute path for additional paired-end input read files.
        2. Add single-end input: add absolute path for additional single-end input read files.
        3. Use # of CPUs: Specify the number of CPUs to be used; default and minimum value is ¼ of total number of CPUs on the server.
        4. Config file: a configuration file is generated automatically for every EDGE run. In the event that a job was interrupted and unfinished, submitting the config file will re-run the job and ensure that the submission runs exactly the same, with the same options.
    2. Batch Project Submission
      Batch submission allows a user to run multiple samples using the same configuration, rather than submitting jobs one-by-one. Batch submission is off by default. If this module is turned on, the ‘Input Sequence’ module will be turned off. To implement batch submission, an Excel file with project name, inputs, and project descriptions must be submitted; a sample is available for download.
      Batch Excel File: If turned on, user must provide absolute path to Batch Excel File.
    3. Sample Metadata
      EDGE supports the input and storage of metadata associated with the genomic or metagenomic sample being analyzed. This currently includes sample type (human, animal, or environmental), isolation source, sample collection date, sample collection location, sequencing platform and sequencing date.

  4. Run EDGE: Choose processes/analyses (see Figure 2)


    Figure 2. Selecting the EDGE modules for analysis. Users can include any module in a workflow by simply clicking the toggle ‘On’ for the module. Clicking on the arrow to the left of the module title shows subsections of the module and parameters which can be adjusted.

    EDGE v1.5 has seven modules: Pre-processing, Assembly and Annotation, Reference-based Analysis, Taxonomy Classification, Phylogenetic Analysis, Gene Family Analysis, and PCR Primer Tools. After input files have been selected and a project name has been assigned, click ‘Submit’ to run the EDGE suite using default parameters. The following analyses are automatically turned on: Pre-processing Quality Trim and Filter, Assembly and Annotation, Taxonomy Classification. Users have full control over which modules to run, and can modify parameter values according to project needs. Each module, and its key parameters are outlined below.
    Note: Modules can be expanded or collapsed by clicking the module header. To turn modules on or off, use the toggle button within each header. Expand each module to see/edit settings. Module parameters and default settings are listed in Table 1. Modules/processes which are set ON by default are shown in Table 1 in green.

    Table 1. EDGE modules and default settings


    1. Pre-processing (Default is ON)
      Note: Pre-processing contains two components: Quality Trim and Filter, and Host Removal. By default, Quality Trim and Filter is turned ON, and Host Removal is turned OFF. This module is not required for downstream analyses, but highly recommended when processing raw reads.
      1. Quality Trim and Filter
        FaQCs software (Lo and Chain, 2014) is used to rapidly analyze reads for quality, then trim or filter those with poor quality. Pre-set parameter values are appropriate to filter unwanted reads in most cases. Exceptions can arise when specialized adapter sequences have been added. In this case, a user can supply FASTA files or specify the number of base pairs to trim from each end of the reads. After this step, only high-quality reads are passed to downstream analyses. Each parameter is described below; default settings can be found in Table 1.
        1. Run Quality Trim and Filter: Yes/No command to execute pipeline
        2. Trim Quality Level: minimum quality threshold based on Phred scores
        3. Average Quality Cutoff: filter based on average quality score of entire read
        4. Minimum Read Length: filter reads based on length
        5. ‘N’ Base Cutoff: discard reads with more than this number of continuous ‘N’ bases
        6. Low Complexity Filter Ratio: indicate maximum fraction of mono-/di-nucleotide sequence permissible
        7. Adapter FASTA: adapters can be removed from sequences; user must provide FASTA file containing adapter sequences
        8. Cut #bp from 5’-end: define a set number of base pairs to remove from the 5’ end of each read
        9. Cut #bp from 3’-end: define a set number of base pairs to remove from the 3’ end of each read
      2. Host Removal
        While called ‘Host Removal’, this module is used to subtract unwanted reads that align to any selected reference. For example, unwanted reads derived from hosts and/or from positive controls, such as PhiX, can be filtered out at this step. Whether to employ this step and if so, which genomes to select for use in host removal, depend on each sample’s origin and therefore is left to the user’s discretion. Reads are mapped to reference genome(s) using BWA, and removed based on the similarity threshold parameter. At the EDGE interface, this module must be turned on to run. Parameter descriptions and uses:
        1. Run Host Removal: Yes/No command to execute pipeline
        2. Select Genome(s): Click on dropdown menu to select common hosts or search for additional hosts from RefSeq by typing in the search text box. Choose relevant host genomes by clicking; blue checkmarks will indicate selected hosts. The number of hosts selected is unlimited. Click on the X in the top left corner of the selection menu to save results.
          Note: This RefSeq database refers to all complete bacterial and archaeal genomes plus complete viral genomes and near neighbors.
        3. Host FASTA File: a user also has the option to upload a specific host sequence (i.e., sequenced in-house, or host is not present in the EDGE database) for removal. To do so, provide the direct path to the FASTA formatted file containing the host sequence.
        4. Similarity (%): Minimum percent similarity threshold is used in host removal when calculated by [Reads aligned bases]/[Reads length] x 100. 90% similarity is the default and lowest recommended setting.
    2. Assembly and Annotation (Default is ON)
      Note: Assembly and Annotation pipelines are turned on by default in EDGE. In order to annotate a genome or perform any downstream contig-based analysis, assembly must be completed. There is an option to upload pre-assembled contigs in the form of a FASTA file, then bypass the assembly module. If assembly fails, downstream modules requiring contig files will be bypassed.
      1. Assembly
        Three different de novo assembler options are provided: IDBA_UD (Peng et al., 2012), SPAdes (Bankevich et al., 2012), and MEGAHIT (Li et al., 2015). Optimal selection of an assembler can depend on sample type (e.g., isolate vs. metagenome), data size, and time available for analysis. Pre-set parameters for each assembler are robust and perform well in the majority of cases. Users can set the minimum cut-off value for final contigs. As default, contigs smaller than 200 bp are filtered out. If using sequence reads longer than 200 bp (for instance 2 x 300 bp), this threshold should be adjusted to the read length. Read-alignment validation is used to ensure confidence in assembly.
        1. Bypass Assembly and use Pre-assembled Contigs: Yes/No command to execute
        2. Assembler: Three de novo assemblers are built into EDGE. IDBA_UD (default) performs efficiently using either isolates and metagenomic samples, however it is not ideal for large genomes. There are multiple preset configurations in SPAdes (tailored for single cells, metagenomes, plasmids, or RNA-Seq) and SPAdes performs well on isolates and metagenomes, but can be very computationally intensive for any large dataset. SPAdes can additionally take in long read data (PacBio or Nanopore) with the Illumina short read data and produces a hybrid assembly; this option increases the computational resources required. MEGAHIT is a fast, robust solution for large and complex metagenomic samples.
        3. Validation Aligner: Bowie 2 (default) or BWA mem can be selected to map reads back to assembled contigs for validation.
      2. Annotation
        Successful assembly is a prerequisite for annotation. EDGE offers users two annotation tools: PROKKA (Seemann, 2014) and RATT (Otto et al., 2011). PROKKA is appropriate for most cases; it has been designed for rapid annotation of prokaryotic genomes. Alternatively, users can use RATT to transfer annotation from an annotated reference genome to an unannotated sample.
        1. Annotation: Yes/No command to execute pipeline
        2. Minimum Contig Length for Annotation: User defined length of contig to include in annotation
        3. Annotation Tool: Two annotation tools are provided. If PROKKA is selected, the user must also choose the genome type to annotate under Specify the Kingdom (Archaea, Bacteria, Mitochondria, Viruses, Others). RATT, on the other hand, will transfer the annotation from a reference genome to the sample of interest. The reference genome must be a close relative to the sample. If RATT is selected, a user must provide the GenBank formatted reference/source annotation file.
    3. Reference-based Analysis
      Reference-based analysis is a useful tool for investigating samples of known composition, for instance, studying a pure bacterial culture. This module maps reads and contigs to references selected by the user to obtain coverage information, and identify uncovered regions where sample reads or assembled contigs, do not align to the reference. The output of this module provides information on variants, such as single nucleotide polymorphisms (SNPs), and uncovered regions (potential insertion/deletions) that do not align to the reference. Variants are identified using SAMtools (Li et al., 2009). The in-depth results can be interactively explored using the genome browser, JBrowse (Skinner et al., 2009).
      Note: Reference-based analysis is off by default. Users can turn this module on using the toggle button.
      1. Select Genome(s): Pre-built reference list. Click on dropdown menu and search for microbial species of interest. It is important to choose closely related organisms. Click on desired references and blue checkmarks will indicate selected genomes. The number of hosts selected is unlimited. Click on the X in the top left corner of the selection menu to save results.
      2. Reference Genome: If a reference organism is not in the pre-constructed list, users can upload an appropriate FASTA or GenBank file for your experiment.
      3. Reads Aligner: Users can choose Bowtie 2 or BWA-MEM as the read mapper. The two algorithms will yield very similar results and can be set based on user preference.
    4. Taxonomy classification (Default is ON)
      The EDGE Taxonomy module will perform sequence classification and determine sample composition. This module is useful for identifying organisms in complex samples. Similarly, taxonomy classification is useful for analyzing purified cultures to detect contamination coming from lab reagents or mishandling of samples.
      Note: By default, taxonomic classification is turned on for both reads and contigs. EDGE implements several different databases and algorithms for taxonomy assignments. By default, all of the tools are turned on to take advantage of their strengths and provide users with cross-validated assessments. The tools vary in sensitivity and classification.
      1. Read-based taxonomy classification
        1. Always Use All Reads: Yes/No command to indicate what reads will be used in taxonomy. Yes (default) indicates that all reads that pass pre-processing will be used. If the user has provided a reference for Reference-based analysis, and selects No, then results will include only reads that are different from the reference.
        2. Classification Tools: Drop down menu with checkbox selection for different profiling tools. Default settings implement all databases, which is recommended in order to take advantage of all tools which vary significantly in terms of sensitivity and specificity. GOTTCHA will provide the most specific results (a very low false positive rate), while BWA and Kraken will provide the most sensitive results (a very low false negative rate).
          1) GOTTCHA Bacterial Databases (Genus, Species, Strain) (specific) (Freitas et al., 2015), version 20150825.
          2) GOTTCHA Viral Databases (Genus, Species, Strain) (specific) (Freitas et al., 2015), version 20150825.
          3) Read Mapping using BWA against RefSeq (see Note above for description) (sensitive) (Chen et al., 2010)
          4) MetaPhlAn, searches clade-specific marker genes (specific) (Segata et al., 2012)
          5) Kraken mini, exact alignment of k-mers (sensitive) (Wood and Salzberg, 2014)
      2. Contig-based Taxonomy Classification
        Contigs classification: Yes/No command to execute. Yes (default) indicates contigs will be mapped against NCBI databases for taxonomy and functional annotations.
    5. Phylogenetic Analysis
      EDGE will construct phylogenetic trees for reads and contigs. PhaME (Ahmed et al., 2015) software is implemented to align core conserved sections of genomes, perform whole-genome SNP discovery, and build a phylogenetic tree. Users can choose from pre-computed pathogen databases or build their own by selecting genomes provided in EDGE databases, or uploading their own.
      Note: Phylogenetic analysis will not automatically run. Users must turn the toggle switch on and address required parameters for this module. Due to the nature of the tool, users should choose closely related strains or species only; this will ensure that the user’s target genome falls within the final tree build. Additionally, although this module has been successfully applied to metagenomes, the phylogeny tool was engineered with isolate genome projects in mind. If a user selects genomes with little to no similarity, the tool will exclude them from the analysis.
      1. Tree Build Method: Users have two options for generating phylogenetic tree. FastTree is fast and selected as default (Price et al., 2010). RAxML is more accurate, but more time consuming (Stamatakis, 2014).
      2. Pre-built SNP DB: EDGE supports 5 pre-computed pathogen databases for SNP phylogeny analysis (Escherichia coli, Yersinia, Francisella, Brucella, Bacillus).
      3. Select Genome(s): RefSeq genomes (see Note above for description) are available in this dropdown menu. Open menu, search, and click on desired genomes to select.
      4. Add Genome(s): User can provide FASTA entries for genomes to be built into tree. A maximum of 20 reference genomes can be used and including an outgroup is recommended.
      5. SRA Accessions: SRA entries are allowed for specifying references to be used in phylogenetic analysis.
      6. Bootstrap: Yes/No command to execute bootstrap in analysis.
      7. Bootstrap Number: Can be modified if user indicates Yes for Bootstrap method.
    6. Gene Family Analysis
      Note: The Gene Family Analysis module searches reads and annotated coding sequences (CDS) for specific gene families (currently, antibiotic resistance and virulence gene families). There are two components to this module: read-based and contig-based profiling. To perform either analysis, users must first turn this module on using the toggle switch, then ensure each sub-analysis is set to Yes. Contig-based analysis requires successful assembly.
      1. Read-based Gene Family Analysis
        The read-based analysis uses the ShortBRED (Kaminski et al., 2015) algorithm to search for antimicrobial resistance genes in the Antibiotic Resistance Genes Database, ARDB, (Lui and Pop, 2009) and Resfams (Gibson et al., 2014) databases. Similarly, ShortBRED will search for virulence genes using a version of the Virulence Factor Database, VFDB, (Chen et al., 2005) curated by an EDGE developer.
        Read-based Gene Family Analysis: Yes/No command to execute analysis.
      2. Contig-based (CDS) Gene Family Analysis
        Similarly, virulence genes are called using ShortBRED and VFDB. However, for antibiotic resistance gene finding, the Comprehensive Antibiotic Resistance Database (CARD) program RGI (Jia et al., 2017) is used for CDS-based analysis performed on contigs.
        Contig-based (CDS) Gene Family Analysis: Yes/No command to execute analysis.
    7. PCR Primer Analysis
      Note: The PCR Primer Analysis module consists of two processes: validation of existing primers and design of new primers. PCR Primer Analysis will not automatically run. Users must turn the toggle switch on, and turn on each independent component of this module.
      1. Primer Validation
        In the validation pipeline, users upload a file containing existing primer sequences. EDGE maps these primers to the sample’s assembly using BWA to determine if the amplicon is generated. Users can define the number of mismatches allowed in this validation.
        1. Run Primer Validation: Yes/No command to execute analysis.
        2. Primer FASTA Sequences: Provide absolute path to input file with existing primers to be validated. Must contain an even number of forward and reverse primers saved in FASTA format.
        3. Maximum Mismatch: Indicate the maximum number of mismatches allowed per primer sequence. Click on the number to select (i.e., 0, 1, 2, 3, or 4).
      2. Primer Design
        To design primers for newly assembled contigs, EDGE identifies unique regions of the sample using BWA then generates primer sets using Primer3 (Untergasser et al., 2012). Primers are designed, and compared to RefSeq to ensure the selected regions in the sample genome are indeed unique. Users indicate desired primer reaction parameters for melting temperature, amplicon sizing, and the number of primer pairs.
        1. Run Primer Design: Yes/No command to execute analysis.
        2. Primer Settings: All reaction parameters can be modified. Parameters for primer design include: Tm Optimum (°C), Tm Range (°C), Length Optimum (bp), Length Range (bp), Background Tm, Differential (°C).
        3. The number of Primer Pairs: Desired amount of new primer pairs to be designed.

  5. Run EDGE: Job submission
    After input files, project name, and desired process options have been defined (Procedure A-Procedure D), a job is ready for submission. To submit a job, click on the ‘Submit’ button at the bottom of the page. Red and/or green indicators will immediately appear to indicate successful job submission. If errors occur, the message (in red) is clickable and will return users to the section needing attention where the data entry box(es) with errors are highlighted with a yellow glow.

  6. Navigate Projects
    Upon successful submission, users can monitor job status by clicking ‘Projects’ in the navigation bar on the left-hand side of the user interface. This provides a list of projects submitted by the user. Job status of each project in the list is indicated by a color-coded system: grey = not yet begun, red = error, orange = in progress (running), green = completed. Errors normally only occur if there are issues in retrieving or reading the input data files. If a job is in progress, clicking on the project in the left navigation bar will open the Project page where results are posted in real time.
    The Project page has links to other pages and to the project list on the left (which can be hidden with a link in the upper left corner). Project information and results are shown in the center section. The information on this page is static and allows users to access portions of the run that are already complete, however the page needs to be refreshed for any updates to the project. A small square icon in the upper right corner of the screen opens a sidebar on the right that displays active monitoring of job progress and server usage. Options to interrupt, delete, rerun or share the project and view the live log are available in this sidebar. Figure 3 shows a screen shot of a project page from https://bioedge.lanl.gov/ and originally published by Oxford University Press in Nucleic Acids Research (Li et al., 2017).


    Figure 3. The EDGE Project page displays an analysis in progress

Data analysis

In addition to the several examples provided in our original publication (Li et al., 2017) which can be viewed at https://bioedge.lanl.gov, two additional use cases were run to demonstrate much of the functionality within EDGE; the full analyses for these runs can be viewed at http://hobo-nickel.getedge.org. The first use case is a reduced dataset for E. coli (10x coverage) which will run quickly to test the modules and is entitled EDGE_tutorial_FASTQ_3. The second use case is a metagenomic sample from an E. coli outbreak in 2011 with the data downloaded from the Sequence Read Archive (SRA) and is entitled EDGE_tutorial_SRA_2. Table 2 details precise EDGE parameter settings for these two runs.

Table 2. Use Case Parameter Settings


The Project page includes the statistics of the run (each module and time to completion) and links to the output directory with all the results, summary log files, and a PDF summary of results (see Figure 3). Each module produces summarized results with both text and figures along with some interactive graphics within the context of the Project page. Figure 4 shows some examples of graphical output from the two use cases described in Table 2.


Figure 4. Examples of graphical output in EDGE for Use Cases 1 and 2. EDGE provides many results as visual graphics within the Project page. Use Case 1 is an isolate E. coli dataset with approximately 10x coverage. This reduced coverage data set was created to test EDGE installation. A. Shows the interactive view of reads and contigs mapped to a reference in the genome browser. B. Shows a closer inspection of a region with an SNP/variant highlighted in the same reference-based analysis. *C. Shows both reads and assembled contigs placed into a phylogenetic tree based on whole genome SNP analysis. D. Shows results for a larger dataset (50x) for the same genome; reads and contigs are placed directly adjacent to one another and to E. coli K12 MG1655 with the higher coverage dataset. Use Case 2 is a clinical fecal sample from an E. coli outbreak in 2011. E. shows results from multiple tools for taxonomy classification in a heatmap.
*Note: At this low depth of coverage the reads and contigs are not placed immediately adjacent to one another in the tree, but the contigs are placed adjacent to the correct reference genome, E. coli K12 MG1655.

Each module also provides links within the integrated visualization for that module for primary desired output from the analyses (e.g., assembled contigs, reads mapped to a reference, SNPS/variants, abundance tables). Figure 5 shows an example of tabular output with links for downloading output files by the project owner or links to external sources of information (e.g., NCBI, ARDB). This is a screen shot of a portion of project page from https://bioedge.lanl.gov/.


Figure 5. Results of reference-based analysis showing links with the tabular output of data mapped to an E. coli genome (one chromosome and three plasmids). This clinical sample came from an E. coli outbreak in 2011. This is the same sample from Use Case 2. The data was mapped back to a reference genome from the same outbreak. The black arrow and bracket on the left highlight links to NCBI for more information about each of the replicons. The yellow arrows indicate links to download output files of reads mapped to each of the replicons. The blue arrow is a link to an interactive view of the reads and contigs mapped to the reference E. coli. replicons. The green arrow and bracket in the lower right highlight links to graphics, tables and the full output of the reference-based analysis module. Similar links are available for all the modules.

Notes

The tools included in EDGE have been selected for robustness, speed, and accuracy. Kmer-based assemblers may display some variation from run to run due to inherent non-deterministic properties of the assemblers, but all other results are fully reproducible. While EDGE provides even novice NGS users with the ability to easily perform complex analyses, we encourage users to understand the tools and algorithms and to have some insight into how results should be interpreted.

Acknowledgments

This work is funded by the Defense Threat Reduction Agency. The views expressed in this manuscript are those of the authors and do not necessarily reflect the official policy or position of the Department of the Navy, the Department of Defense, the National Institutes of Health, the Department of Health and Human Services, nor the U.S. Government. Title 17 U.S.C. §105 provides that ‘Copyright protection under this title is not available for any work of the United States Government.’ Title 17 U.S.C. §101 defines a U.S. Government work as a work prepared by a military service member or employee of the U.S. Government as part of that person’s official duties.

References

  1. Ahmed, S. A., Lo, C. C., Li, P. E., Davenport, K. W. and Chain, P. S. G. (2015). From raw reads to trees: Whole genome SNP phylogenetics across the tree of life. bioRxiv.
  2. Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., Pyshkin, A. V., Sirotkin, A. V., Vyahhi, N., Tesler, G., Alekseyev, M. A. and Pevzner, P. A. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19(5): 455-477.
  3. Chen, L., Yang, J., Yu, J., Yao, Z., Sun, L., Shen, Y., Jin, Q. (2005). VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res 33: D325-8.
  4. Chen, P. E., Cook, C., Stewart, A. C., Nagarajan, N., Sommer, D. D., Pop, M., Thomason, B., Thomason, M. P., Lentz, S., Nolan, N., Sozhamannan, S., Sulakvelidze, A., Mateczun, A., Du, L., Zwick, M. E. and Read, T. D. (2010). Genomic characterization of the Yersinia genus. Genome Biol 11(1): R1.
  5. Freitas, T. A., Li, P. E., Scholz, M. B. and Chain, P. S. (2015). Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res 43(10): e69.
  6. Gibson, M. K., Forsberg, K. J., Dantas, G. (2015). Improved annotation of antibiotic resistance functions reveals microbial resistomes cluster by ecology. ISME J 9(1): 207-16.
  7. Jia, B., Raphenya, A. R., Alcock, B., Waglechner, N., Guo, P., Tsang, K. K., Lago, B. A., Dave, B. M., Pereira, S., Sharma, A. N., Doshi, S., Courtot, M., Lo, R., Williams, L. E., Frye, J. G., Elsayegh, T., Sardar, D., Westman, E. L., Pawlowski, A. C., Johnson, T. A., Brinkman, F. S., Wright, G. D. and McArthur, A. G. (2017). CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res 45(D1): D566-D573.
  8. Kaminski, J., Gibson, M. K., Franzosa, E. A., Segata, N., Dantas, G. and Huttenhower, C. (2015). High-specificity targeted functional profiling in microbial communities with ShortBRED. PLoS Comput Biol 11(12): e1004557.
  9. Li, D., Liu, C. M., Luo, R., Sadakane, K. and Lam, T. W. (2015). MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31(10): 1674-1676.
  10. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16): 2078-9.
  11. Li, P. E., Lo, C. C., Anderson, J. J., Davenport, K. W., Bishop-Lilly, K. A., Xu, Y., Ahmed, S., Feng, S., Mokashi, V. P. and Chain, P. S. (2017). Enabling the democratization of the genomics revolution with a fully integrated web-based bioinformatics platform. Nucleic Acids Res 45(1): 67-80.
  12. Lo, C. C. and Chain, P. S. (2014). Rapid evaluation and quality control of next generation sequencing data with FaQCs. BMC Bioinformatics 15: 366.
  13. Lui, B. and Pop, M. (2009). ARDB--Antibiotic Resistance Genes Database. Nucleic Acids Res 37(Database issue): D443-7.
  14. Otto, T. D., Dillon, G. P., Degrave, W. S. and Berriman, M. (2011). RATT: Rapid annotation transfer tool. Nucleic Acids Res 39(9): e57.
  15. Peng, Y., Leung, H. C., Yiu, S. M. and Chin, F. Y. (2012). IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11): 1420-1428.
  16. Price, M. N., Dehal, P. S. and Arkin, A. P. (2010). FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One 5(3): e9490.
  17. Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics 30(14): 2068-2069.
  18. Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O. and Huttenhower, C. (2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods 9(8): 811-814.
  19. Skinner, M. E., Uzilov, A. V., Stein, L. D., Mungall, C. J. and Holmes, I. H. (2009). JBrowse: a next-generation genome browser. Genome Res 19: 1630-1638.
  20. Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9): 1312-1313.
  21. Untergasser, A., Cutcutache, I., Koressaar, T., Ye, J., Faircloth, B. C., Remm, M. and Rozen, S. G. (2012). Primer3--new capabilities and interfaces. Nucleic Acids Res 40(15): e115.
  22. Wood, D. E. and Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15(3): R46.

简介

新一代测序(NGS)为无目标生物体检测和表征提供了无与伦比的解决方案。但是,大多数NGS分析程序要求用户精通编程和命令行界面。 EDGE生物信息学开发的目的是为科学家提供一点点或不具备生物信息学专业知识的点 - 点击平台,以快速和可重复的方式分析测序数据。 EDGE( 开发 enomics e xpertise)v1.0于2017年1月发布,是一个直观的基于Web的生物信息学平台,专门用于分析微生物和宏基因NGS数据(Li 等),2017)。 EDGE生物信息学套件结合了公开可用的工具,并跟踪设置,以确保可靠和可重复的分析工作流程。要执行EDGE工作流程,只需要原始测序读取和项目ID。用户可以访问内部数据,或运行序列读取存档中保存的样品的分析。默认设置提供了一个强大的第一眼,对新手用户来说通常是足够的。所有分析均为模块化;用户可以轻松打开/关闭工作流程,并修改参数以满足项目需求。结果被编辑并可以下载到PDF格式的报告中,其中包含出版物质量数据。我们告诫说,解释结果仍然需要深入的科学理解,然而报告视觉信息往往是信息性的,甚至新手用户。

【背景】EDGE生物信息学被开发来帮助生物学家快速处理下一代测序(NGS)数据,即使他们几乎没有生物信息学专业知识。 EDGE是一个高度集成和交互式的基于网络的平台,能够运行生物学家对病毒,细菌/古细菌和宏基因组样品所需的许多标准分析。 EDGE为用户输入提供直观的基于Web的界面,允许用户可视化并与选定结果交互,并生成最终的详细PDF报告。可以下载表格,文本文件,图形文件和PDF以及执行程序的原始输出文件的结果。用户管理系统可以跟踪个人的EDGE运行情况,以及共享,公开发布,删除或归档结果的能力。用户可以在用户友好,直观的基于网络的环境中探索正在进行的数据处理,并以逐个样本为基础呈现交互式结果。尽管EDGE被有意设计为尽可能简单,但仍然没有适合生物信息学领域所有用例的单一“工具”或算法。我们的目的是从各个分析角度提供用户样本的详细全景图。 EDGE于2017年1月首次发布,提供了六个分析工作流程:预处理(数据QC和宿主移除),装配和注释,参考基础分析,分类分类,系统发育分析和PCR分析(验证和设计)。最新版本(版本1.5)包括几个新功能:抗菌药物耐药性和毒力基因的鉴定,使用QIIME的16S / 18S /真菌ITS分析,元数据收集/存储以及多个宏基因组样品的分类分类比较分析。 EDGE生物信息学是为NGS数据分析提供最佳生物信息学工具的持续努力。目前模块的更新是连续的,更多的模块正在开发中。

关键字:基因组学, 生物信息学, 下一代测序, 宏基因组学

设备

  1. 在本地服务器上安装EDGE
    1. 硬件要求:对于需要同时处理多个大型(5,500-500万次读取)样本的高吞吐量用户,强烈建议使用具有256 GB内存和64个计算CPU以及8 TB本地存储的计算机。其中一个演示服务器的当前计算硬件( https://bioedge.lanl.gov )是戴尔,带有4个Intel(R)Xeon(R)CPU E5-2630 v2 @ 2.60GHz = 24个内核(48个线程)和512 GB RAM的PowerEdge R720。
    2. 系统要求:EDGE生物信息学已经在Ubuntu 14.04或CentOS 7操作系统的Linux服务器上进行过测试,并且需要64位Linux环境。 EDGE不能在Mac OS X上本地运行,但是,如果有足够的计算资源可用于分析,则可以在Linux环境中的Mac上安装Dockerized版本的EDGE。
    3. EDGE软件安装之前所需的基本库和依赖关系在 https://edge.readthedocs.io/en/v1.5/system_requirement.html
    4. 为了简化安装,还可以使用OVF中的Docker镜像或虚拟机。信息和链接位于 http://edge.readthedocs .io / en / v1.5 / installation.html#edge-docker-image http://edge.readthedocs.io/en/v1.5/installation.html#edge-vmware-ovf-image
    5. 有用的资源:可以在 http:/ / http:/ / /tutorial.getedge.org 。有关软件安装的书面文档以及如何运行EDGE,请访问 https://edge.readthedocs.io /en/v1.5/
  2. 演示版的EDGE
    美国洛斯阿拉莫斯国家实验室(LANL)和海军医学研究中心(NMRC)为未来的用户提供面向EDGE生物信息的演示版本。任何连接互联网的电脑都可以使用基于Web的演示EDGE生物信息学平台。要对所提供的测试数据或序列读取存档(SRA)中保存的样品进行分析,请访问 https://bioedge.lanl .GOV 。要上传您自己的数据(最大文件大小为5 Gbp)并运行EDGE,请访问 http:// hobo-nickel。 getedge.org

软件

  1. EDGE源代码是开源的,可以在 https://github.com /LANL-Bioinformatics/EDGE/tree/v1.5
  2. FaQCs软件(Lo and Chain,2014)
  3. PhaME(Ahmed等,2015)软件

程序

注意:此处描述的程序假设用户已经获得了EDGE生物信息学的演示版本,或者已经在本地安装了EDGE。用户必须登录到EDGE帐户才能上传数据,运行分析或查看过去提交的内容。如果你想安装EDGE,详细的命令行指令是详细的 https://edge.readthedocs.io/en/v1.5/installation.html 。请直接与开发团队的任何成员联系,或访问我们的Google小组 edge-users @ googlegroups.com 。该过程概述如下:访问EDGE,上传数据文件,运行EDGE输入示例,运行EDGE选择进程,运行EDGE作业提交以及导航项目。描述如何设置和运行每个EDGE模块的基于网络的视频教程系列可以在 http://tutorial.getedge.org

  1. 访问边缘
    注:要在任何平台(即内部版本或演示版本)上运行EDGE,用户需要创建一个帐户并登录。用户权限可以设置为管理对数据和分析的访问级别。
    1. 用户帐户&登录
      1. 打开Web浏览器并访问EDGE(请参阅上述演示版本,或访问本地版本)。网址(URL)取决于本地安装版本的内部网络配置。
      2. 在EDGE界面,点击右上角的轮廓。对于新用户,请提交新帐户的信息。对于返回的用户,使用EDGE凭证登录。

  2. 上传数据文件
    点击“上传文件”标签。拖放文件上传,或点击“+添加文件”按钮。选择“开始上传”完成。请注意,上传的最大文件大小是5 GB。这可以在EDGE的本地安装上进行配置。允许的文件类型包括FASTQ,FASTA,GenBank和测试(txt,config,ini),并且可以是gzip格式。这些文件可以位于MyUploads目录中。

  3. 运行EDGE:输入样本(见图1)
    注:EDGE参数配置针对Illumina数据进行了优化。要运行EDGE,单个样本需要FASTQ格式的序列数据文件。
    点击“运行EDGE”来设置和提交作业到EDGE生物信息学管道。第一部分“输入原始读取”需要用户提供一个项目名称和序列数据。 EDGE接受原始FASTQ文件(单个或配对结束)或序列读取存档(SRA)登录号。这里提供了每个设置的详细信息。


    图1.在EDGE中启动一个运行。 必需的输入包括唯一的项目名称和要处理/分析的测序读取位置。批量提交是可能的,并鼓励元数据收集。

    1. 输入原始读取
      1. 项目名称
        必填项目。输入项目名称。请注意,相同的项目名称可以重复使用,因此鼓励使用唯一的标识符。项目名称最少为3个字符,最多为30个字符。避免使用空格,但破折号和下划线是可以接受的。
      2. 说明
        可选字段。进入空间来更详细地描述项目/样本。
      3. 从排序读取档案(SRA)输入
        可点击是/否切换,控制访问输入序列读取位置的选项。默认设置是否;这要求用户使用EDGE输入目录或上传文件目录中的单端或双端FASTQ文件。 (有关设置为否时如何继续的说明,请参阅步骤C1d)。当此切换设置为是时,读取从SRA登录号获得。访问互联网需要使用是选项。支持的SRA登录格式包括:研究(SRP * / ERP * / DRP *),实验(SRX * / ERX * / DRX *),样本(SRS * / ERS / DRS * DRR *)或提交(SRA * / ERA * / DRA *)。
      4. 排序读取
        必填项目。默认设置要求用户指示数据的直接路径(参见步骤C1c SRA输入)。 EDGE接受FASTQ格式的序列数据文件;压缩文件(.gz)也是可以接受的。双端序列和单端序列都是允许的。运行EDGE需要绝对文件路径。用户可以单击文件路径文本框右侧的圆形按钮来访问上传文件目录或内部配置的其他目录结构( ie ,直接链接到内部Illumina定序器)的数据, 。
      5. 其他选项
        在大多数情况下,可以忽略其他选项。如果用户想要添加更多输入读取文件,或增加作业的CPU,则此字段提供这些选项。点击此字段将显示以下参数:
        1. 添加配对端输入:为其他配对端输入读取文件添加绝对路径。
        2. 添加单端输入:为其他单端输入读取文件添加绝对路径。
        3. 使用CPU数量:指定要使用的CPU数量;默认值和最小值是服务器上CPU总数的四分之一。
        4. 配置文件:为每个EDGE运行自动生成一个配置文件。如果作业中断和未完成,提交配置文件将重新运行作业,并确保提交运行完全相同,具有相同的选项。
    2. 批量项目提交
      批量提交允许用户使用相同的配置运行多个样本,而不是逐个提交作业。批量提交默认是关闭的。如果此模块已打开,则“输入序列”模块将被关闭。要实现批量提交,必须提交一个带有项目名称,输入和项目描述的Excel文件;一个样本可供下载。
      批量Excel文件:如果打开,用户必须提供批量Excel文件的绝对路径。
    3. 示例元数据
      EDGE支持与被分析的基因组或宏基因组样本相关的元数据的输入和存储。目前包括样本类型(人类,动物或环境),分离源,样本采集日期,样本采集地点,测序平台和排序日期。

  4. 运行EDGE:选择进程/分析(见图2)


    图2.选择EDGE模块进行分析。 用户只需点击模块的“开启”按钮即可在工作流程中添加任何模块。点击模块标题左边的箭头,可以调整模块和参数的子部分。

    EDGE v1.5有七个模块:预处理,装配和注释,参考分析,分类分类,系统发育分析,基因家族分析和PCR引物工具。输入文件被选择并且项目名称被分配后,点击'Submit'使用默认参数运行EDGE套件。以下分析会自动打开:预处理质量修剪和过滤,装配和注释,分类分类。用户可以完全控制要运行的模块,并可以根据项目需要修改参数值。每个模块及其关键参数概述如下。
    注:模块可以通过单击模块头来展开或折叠。要打开或关闭模块,请使用每个标题内的切换按钮。展开每个模块以查看/编辑设置。表1列出了模块参数和默认设置。默认设置为ON的模块/进程以绿色显示在表1中。

    表1. EDGE模块和默认设置


    1. 预处理(默认为ON)
      注意:预处理包含两个组件:质量修剪和过滤以及主机移除。默认情况下,质量修剪和过滤器打开,主机移除关闭。下游分析不需要此模块,但在处理原始读取时强烈建议使用此模块。
      1. 质量修剪和过滤
        FaQCs软件(Lo and Chain,2014)用于快速分析质量读数,然后修剪或过滤质量较差的软件。在大多数情况下,预先设置的参数值适合于过滤不需要的读取。添加专用适配器序列时可能会出现异常情况。在这种情况下,用户可以提供FASTA文件或指定从读取的每一端进行修剪的碱基对数量。在这一步之后,只有高质量的读数被传递到下游分析。下面介绍每个参数。默认设置可以在表1中找到。
        1. 运行质量修剪和过滤:是/否命令执行管道
        2. 修剪质量等级:基于Phred评分的最低质量阈值
        3. 平均质量截止:根据整个阅读的平均质量分数过滤
        4. 最小读取长度:基于长度的过滤器读取
        5. 'N'基准截止:丢弃超过这个连续的'N'基数的读数
        6. 低复杂度滤波器比率:指示允许的单核苷酸/二核苷酸序列的最大分数
        7. 适配器FASTA:适配器可以从序列中删除;用户必须提供包含适配器序列的FASTA文件
        8. 从5'末端切下#bp:定义一组碱基对,以从每次读取的5'末端移除
        9. 从3'末端剪下#bp:定义从每次读取的3'末端移除的一组碱基对。
      2. 主机移除
        虽然称为“主机移除”,但是该模块用于减去与任何选定参考对齐的不需要的读取。例如,可以在此步骤中过滤掉来自主机和/或来自正向控件(例如PhiX)的不需要的读取。是否采用这一步,如果是这样,哪个基因组选择用于去除宿主,取决于每个样本的来源,因此留给用户自行决定。读取使用BWA映射到参考基因组,并基于相似性阈值参数移除。在EDGE接口上,这个模块必须打开才能运行。参数说明和用途:
        1. 运行主机删除:是/否命令执行管道
        2. 选择基因组:点击下拉菜单选择普通主机,或通过在搜索文本框中输入搜索来自RefSeq的其他主机。点击选择相关的宿主基因组;蓝色复选标记将指示选定的主机。所选主机的数量是无限的。点击选择菜单左上角的X来保存结果。
          注意:这个RefSeq数据库是指所有完整的细菌和古细菌基因组以及完整的病毒基因组和近邻。
        3. 主机FASTA文件:用户还可以选择上传特定的主机序列(即,内部排序或主机不在EDGE数据库中)进行删除。为此,请提供包含主机序列的FASTA格式文件的直接路径。
        4. 相似度(%):通过[Reads aligned bases] / [Reads length] x 100来计算主机移除时使用的最小相似度阈值百分比.90%的相似度是默认和最低的推荐设置。
    2. 装配和注解(默认为ON)
      注:汇编和注释管道默认在EDGE中打开。为了注释基因组或进行任何下游基于重叠群的分析,必须完成组装。有一个选项可以上传FASTA文件形式的预组装重叠群,然后绕过组装模块。如果组装失败,需要重叠群文件的下游模块将被绕过。
      1. 大会
        提供了三种不同的汇编程序选项:IDBA_UD(Peng等人,2012),SPAdes(Bankevich等人,2012),和MEGAHIT(Li等人,2015)。汇编程序的最佳选择可以取决于样本类型(例如,分离物与宏基因组),数据大小和可用于分析的时间。每个汇编器的预设参数都是健壮的,并在大多数情况下表现良好。用户可以设置最终重叠群的最小截断值。默认情况下,小于200bp的重叠群被滤除。如果使用的序列阅读长度超过200 bp(例如2 x 300 bp),则应将此阈值调整为阅读长度。读取对齐验证用于确保装配的可靠性。
        1. 旁路组装和使用预组装重叠群:是/否命令执行
        2. 汇编:EDGE中内置了三个 de novo 汇编器。 IDBA_UD(默认)使用分离物和宏基因组样本进行有效的分析,但对于大型基因组来说并不理想。 SPAdes中有多种预设配置(专为单细胞,宏基因组,质粒或RNA-Seq量身定制),SPAdes在分离物和宏基因组上表现良好,但对于任何大型数据集,其计算密集度都很高。此外,SPAdes还可以利用Illumina短读取数据读取长时间读取数据(PacBio或Nanopore),并生成混合组件;这个选项增加了所需的计算资源。 MEGAHIT是用于大型复杂宏基因组样品的快速,稳健的解决方案。
        3. 验证校准器:可以选择Bowie 2(默认)或BWA mem将读取映射回已组装的重叠群进行验证。
      2. 注释
        成功的组装是注解的先决条件。 EDGE为用户提供了两个注释工具:PROKKA(Seemann,2014)和RATT(Otto et al。,2011)。 PROKKA适用于大多数情况;它被设计用于快速注释原核基因组。或者,用户可以使用RATT将注释从参考基因组注释转移到未注释的样本。
        1. 注释:是/否命令执行管道
        2. 用于注释的最小重叠群长度:用户定义的包含在注释中的重叠群的长度
        3. 注释工具:提供了两个注释工具。如果选择PROKKA,则用户还必须选择指定王国(古细菌,细菌,线粒体,病毒,其他)的基因组类型来注释。另一方面,RATT将把来自参考基因组的注释转移到感兴趣的样本。参考基因组必须与样本密切相关。如果选择了RATT,用户必须提供GenBank格式的参考/源注释文件。
    3. 基于参考的分析
      基于参考的分析是研究已知成分样品的有用工具,例如研究纯细菌培养。该模块将读取和重叠群位置映射到由用户选择的参考以获得覆盖信息,并且识别样本读取或组装的重叠群的未覆盖区域不与参照物对齐。该模块的输出提供了关于变体的信息,例如单核苷酸多态性(SNP)和未覆盖的区域(潜在的插入/缺失),其不与参考对齐。使用SAM工具来识别变体(Li等人,2009)。可以使用基因组浏览器JBrowse(Skinner等人,2009)交互式地探索深入的结果。
      注意:基于引用的分析默认是关闭的。用户可以使用切换按钮打开此模块。
      1. 选择基因组:预建参考列表。点击下拉菜单并搜索感兴趣的微生物物种。选择密切相关的生物是重要的。点击所需的参考,蓝色的复选标记将指示选定的基因组。所选主机的数量是无限的。点击选择菜单左上角的X来保存结果。
      2. 参考基因组:如果参考生物体不在预先构建的列表中,用户可以上传适合您的实验的FASTA或GenBank文件。
      3. 读取对齐器:用户可以选择Bowtie 2或BWA-MEM作为读取映射器。这两种算法会产生非常相似的结果,可以根据用户的偏好设置。
    4. 分类分类(默认为ON)
      EDGE分类模块将执行序列分类并确定样本组成。该模块对于识别复杂样品中的生物体非常有用。类似地,分类分类对于分析纯化的培养物来检测来自实验室试剂的污染或样品处理不当是有用的。
      注意:默认情况下,读取和重叠群都会打开分类分类。 EDGE为分类分配实现了几种不同的数据库和算法。默认情况下,打开所有工具以利用其优势并为用户提供交叉验证的评估。这些工具在灵敏度和分类上有所不同。
      1. 基于阅读的分类分类
        1. 始终使用所有读取:是/否命令来指示将在分类中使用哪些读取。是(默认)表示将使用所有通过预处理的读取。如果用户提供了基于参考的分析的参考,并选择“否”,则结果将只包括与参考不同的读取。
        2. 分类工具:下拉菜单,复选框选择不同的分析工具。默认设置实现所有数据库,建议使用所有的数据库,以利用所有工具在灵敏度和特异性方面显着不同。 GOTTCHA将提供最具体的结果(非常低的假阳性率),而BWA和Kraken将提供最敏感的结果(非常低的假阴性率)。
          1)GOTTCHA细菌数据库(Genus,Species,Strain)( specific )(Freitas ,2015),版本20150825.
          2)GOTTCHA病毒数据库(Genus,Species,Strain)( specific )(Freitas ,2015),版本20150825.
          3)使用BWA对RefSeq进行读取映射(见上面的说明)( sensitive )(Chen et。,2010)
          4)MetaPhlAn搜索分支特异性标记基因( specific )(Segata et al。,2012)
          5)Kraken mini,k-mers精确对齐( sensitive )(Wood and Salzberg,2014)
      2. 基于重叠群的分类分类
        重叠群分类:是/否执行命令。是(默认)表示重叠群将被映射到NCBI数据库,用于分类和功能注释。
    5. 系统发育分析
      EDGE将构建读取和重叠群的系统发生树。 PhaME(Ahmed等人,2015)软件被用于比对基因组的核心保守部分,执行全基因组SNP发现,并构建系统发育树。用户可以从预先计算的病原体数据库中进行选择,也可以通过选择EDGE数据库中提供的基因组或自行上传来构建自己的病毒库。
      注:系统发育分析不会自动运行。用户必须打开切换开关,并为该模块寻址所需的参数。由于该工具的性质,用户应该选择密切相关的菌株或物种;这将确保用户的目标基因组落在最终的树形构建中。此外,虽然这个模块已经成功地应用于宏基因组,但是该系统发育工具是针对分离的基因组项目而设计的。如果用户选择几乎没有相似性的基因组,则该工具将从分析中排除它们。
      1. 树建立方法:用户有两个选项来生成系统发育树。 FastTree速度快,被选为默认(Price et。,2010)。 RAxML更准确,但更耗时(Stamatakis,2014)。
      2. 预先构建的SNP DB:EDGE支持5个用于SNP系统发育分析的预先计算的病原体数据库(大肠杆菌,耶尔森氏菌,弗朗西斯氏菌,布鲁氏菌,芽孢杆菌属)。
      3. 选择基因组:RefSeq基因组(请参阅上面的注意说明)可在此下拉菜单中找到。打开菜单,搜索,并点击所需的基因组来选择。
      4. 添加基因组:用户可以提供用于将基因组构建到树中的FASTA条目。最多可以使用20个参考基因组,包括一个outgroup被推荐。
      5. SRA加入:SRA条目允许指定参考用于系统发育分析。
      6. Bootstrap:是/否命令在分析中执行bootstrap。
      7. Bootstrap编号:如果用户为Bootstrap方法指示Yes,可以修改。
    6. 基因家族分析
      注:基因家族分析模块针对特定基因家族(目前为抗生素抗性和毒力基因家族)搜索阅读和注释编码序列(CDS)。这个模块有两个组件:基于读取和基于重叠群的分析。要进行分析,用户必须首先使用切换开关打开此模块,然后确保每个子分析都设置为是。基于Contig的分析需要成功组装。
      1. 基于阅读的基因家族分析
        基于阅读的分析使用ShortBRED(Kaminski等人,2015)算法来搜索抗生素抗性基因数据库ARDB(Lui和Pop,2009)和Resfams(Gibson ,2014)数据库。同样,ShortBRED将使用由EDGE开发商策划的毒性因子数据库(VFDB)版本(Chen等人,2005)搜索毒力基因。
        基于阅读的基因家族分析:是/否执行分析的命令。
      2. 基于重叠群(CDS)基因家族分析
        同样,使用ShortBRED和VFDB来调用毒力基因。然而,为了抗生素抗性基因的发现,综合抗生素抗性数据库(CARD)计划RGI(Jia等人,2017)用于在重叠群上进行的基于CDS的分析。
        基于重叠群(CDS)基因家族分析:是/否执行分析的命令。
    7. PCR引物分析
      注:PCR引物分析模块由两个步骤组成:验证现有引物和设计新引物。 PCR引物分析不会自动运行。用户必须打开切换开关,并打开此模块的每个独立组件。
      1. 入门验证
        在验证管道中,用户上传包含现有引物序列的文件。 EDGE使用BWA将这些引物映射到样品的组装,以确定是否产生扩增子。用户可以定义验证中允许的不匹配数量。
        1. 运行入门验证:是/否命令执行分析。
        2. 引物FASTA序列:提供绝对路径输入文件与现有的引物进行验证。必须包含偶数个以FASTA格式保存的正向和反向引物。
        3. 最大不匹配:指示每个引物序列允许的最大错配数量。点击数字选择(即,0,1,2,3或4)。
      2. 底漆设计
        为了设计用于新组装的重叠群的引物,EDGE使用BWA鉴定样品的独特区域,然后使用Primer3产生引物组(Untergasser等人,2012)。设计引物,并与RefSeq进行比较,以确保样本基因组中选定的区域确实是唯一的。用户指出所需的引物反应参数,用于解链温度,扩增子的大小以及引物对的数目。
        1. 运行入门设计:是/否命令执行分析。
        2. 引物设置:所有的反应参数都可以修改。最适温度范围(°C),最佳长度(bp),长度范围(bp),背景T微分(°C)。
        3. 引物对的数量:需要设计的新引物对的数量。

  5. 运行EDGE:作业提交
    在定义了输入文件,项目名称和所需的过程选项之后(过程A-过程D),可以提交作业。要提交作业,请点击页面底部的“提交”按钮。红色和/或绿色指标将立即显示,表明提交工作成功。如果发生错误,则消息(红色)可点击,并将用户返回到需要注意的部分,其中带有错误的数据输入框以黄色辉光突出显示。

  6. 导航项目
    成功提交后,用户可以通过单击用户界面左侧导航栏中的“项目”来监控作业状态。这提供了用户提交的项目列表。列表中每个项目的工作状态由颜色编码系统表示:灰色=尚未开始,红色=错误,橙色=进行中(正在运行),绿色=已完成。通常只有在检索或读取输入数据文件时出现问题才会出现错误。如果工作正在进行中,单击左侧导航栏中的项目将打开实时发布结果的“项目”页面。
    项目页面有链接到其他页面和左侧的项目列表(可以通过左上角的链接隐藏)。项目信息和结果显示在中心部分。此页面上的信息是静态的,允许用户访问已经完成的部分运行,但是页面需要刷新以便对项目进行更新。屏幕右上角的一个小方形图标打开右侧的侧栏,显示作业进度和服务器使用情况的主动监控。图3 显示了一个来自 https://bioedge.lanl.gov/ ,最初由牛津大学出版社出版的”核酸研究“(Li等人,2017)。


    图3. EDGE项目页面显示正在进行的分析

数据分析

除了我们原始出版物(Li等人,2017)提供的几个例子,可以在 https://bioedge.lanl.gov ,运行了另外两个用例来演示EDGE中的许多功能。这些运行的完整分析可以在 http://hobo-nickel.getedge.org 上查看。第一个用例是E的简化数据集。 (10倍覆盖率),它将快速运行以测试模块,并命名为EDGE_tutorial_FASTQ_3。第二个用例是来自E的宏基因组样本。从序列读取档案(SRA)下载的数据,2011年大肠杆菌爆发,标题为EDGE_tutorial_SRA_2。表2 详细说明了这两个运行的精确EDGE参数设置。

表2.用例参数设置


“项目”页面包含运行(每个模块和完成时间)的统计信息以及包含所有结果,汇总日志文件和结果的PDF汇总(参见图3 )的输出目录的链接。 。每个模块都会在项目页面的上下文中生成带有文本和图形的概要结果以及一些交互式图形。 图4 显示了表2中描述的两个用例的图形输出的一些示例。


图4.用例1和用例2中EDGE的图形输出示例 EDGE在Project页面中提供了许多可视化图形的结果。使用案例1 是一个隔离区 E。大约10倍的覆盖率的大肠杆菌数据集。这个缩小的覆盖范围数据集是为测试EDGE安装而创建的。 A.显示映射到基因组浏览器中引用的读取和重叠群的交互式视图。 B.显示在同一参考分析中突出显示的具有SNP /变体的区域的仔细检查。 *C。基于全基因组SNP分析显示置入系统发生树中的读取和组装重叠群。 D.显示相同基因组的较大数据集(50x)的结果;读取和重叠群彼此直接相邻,并且具有较高的覆盖数据集的大肠杆菌K12MG1655 。使用案例2 是来自E的临床粪便样本。大肠杆菌爆发2011年E.显示结果从多个工具分类分类在热图。
注意:在这个低覆盖深度下,阅读和重叠群并不是直接相邻地放置在树上,而是将重叠群放置在正确的参考基因组大肠杆菌K12 MG1655附近。

每个模块还在该模块的集成可视化中为来自分析的主要期望输出提供链接(例如,组装的重叠群,读取映射到参考,SNPS /变体,丰度表)。 图5 显示了一个表格输出的例子,其中包含由项目所有者下载输出文件的链接或链接到外部信息源(例如,NCBI,ARDB)的链接。这是来自 https://bioedge.lanl.gov/ 的部分项目页面的屏幕截图。< br />

图5.基于参考的分析结果显示了与映射到 E的数据表格输出的链接。大肠杆菌基因组(一个染色体和三个质粒)。 这个临床样本来自于E。大肠杆菌爆发。这与使用案例2 中的示例相同。数据被映射回来自同一爆发的参考基因组。左侧的黑色箭头和括号突出显示了有关每个复制子的更多信息的NCBI链接。黄色箭头表示链接,用于下载映射到每个复制器的读取输出文件。蓝色箭头指向映射到参考 E的读段和重叠群的交互式视图的链接。大肠杆菌。复制子。右下方的绿色箭头和括号突出显示了图形,表格和基于参考的分析模块的完整输出的链接。类似的链接可用于所有模块。

笔记

EDGE中包含的工具已经被选中用于稳健性,速度和准确性。由于装配工固有的非确定性属性,基于Kmer的装配商可能会在运行之间发生一些变化,但所有其他结果都是完全可重现的。虽然EDGE甚至为新手NGS用户提供了轻松执行复杂分析的功能,但我们鼓励用户了解工具和算法,并对结果应如何解释有所了解。

致谢

这项工作由防御减少威胁机构资助。本手稿中所表达的观点是作者的观点,并不一定反映海军部,国防部,国立卫生研究院,卫生与人类服务部以及美国的官方政策或立场。政府。标题17美国第105条规定,“本标题下的版权保护不适用于美国政府的任何工作”。 §101将美国政府的工作定义为由美国政府军官或雇员编制的工作,作为该人员的一部分职责。

参考

  1. Ahmed,S.A.,Lo,C.C.,Li,P.E.,Davenport,K.W.and Chain,P.S.G。(2015)。 从原始阅读到树木:整个基因组SNP系统发育在生命之树上。 / a> bioRxiv 。
  2. Bankevich,A.,Nurk,S.,Antipov,D.,Gurevich,AA,Dvorkin,M.,Kulikov,AS,Lesin,VM,Nikolenko,SI,Pham,S.,Prjibelski,AD,Pyshkin,AV,Sirotkin ,AV,Vyahhi,N.,Tesler,G.,Alekseyev,MA和Pevzner,PA(2012)。 SPAdes:一种新的基因组组装算法及其在单细胞测序中的应用 J Comput Biol 19(5):455-477。
  3. Chen,L.,Yang,J.,Yu,J.,Yao,Z.,Sun,L.,Shen,Y.,Jin,Q。(2005)。 VFDB:细菌毒力因子的参考数据库 Nucleic Acids Res < / em> 33:D325-8。
  4. Chen,PE,Cook,C.,Stewart,AC,Nagarajan,N.,Sommer,DD,Pop,M.,Thomason,B.,Thomason,MP,Lentz,S.,Nolan,N.,Sozhamannan,S. ,Sulakvelidze,A.,Mateczun,A.,Du,L.,Zwick,ME和Read,TD(2010)。 耶尔森氏菌属的基因组特征。基因组生物学 11(1):R1。
  5. Freitas,T.A.,Li,P.E。,Scholz,M.B。和Chain,P.S。(2015)。 使用独特签名的等级套件对基于阅读的宏基因组进行准确定性 Nucleic Acids Res 43(10):e69。
  6. Gibson,M.K。,Forsberg,K.J。,Dantas,G。(2015)。 抗生素耐药性功能的改进注释揭示了微生物的抗生素簇通过生态学 ISME J 9(1):207-16。
  7. 本文作者相关文章关键词:生物信息学,生物信息学,生物信息学,生物信息学,生物信息学,生物信息学,生物信息学,生物信息学,生物信息学,生物信息学, Courtot,M.,Lo,R.,Williams,LE,Frye,JG,Elsayegh,T.,Sardar,D.,Westman,EL,Pawlowski,AC,Johnson,TA,Brinkman,FS,Wright,GD和McArthur, AG(2017)。 CARD 2017:全面抗生素抗性数据库的扩展和以模型为中心的管理 < (核酸研究)45(D1):D566-D573。
  8. Kaminski,J.,Gibson,M. K.,Franzosa,E. A.,Segata,N.,Dantas,G。和Huttenhower,C。(2015)。 用ShortBRED技术在微生物群落中进行高特异性靶向功能分析 PLoS Comput生物学11(12):e1004557。
  9. Li,D.,Liu,C.M.,Luo,R.,Sadakane,K.和Lam,T.W。(2015)。 MEGAHIT:超快速单节点解决方案,适用于通过简洁de Bruijn图形进行大型复杂的宏基因组装。 Bioinformatics 31(10):1674-1676。
  10. Li,H.,Handsaker,B.,Wysoker,A.,Fennell,T.,Ruan,J.,Homer,N.,Marth,G.,Abecasis,G.,Durbin,R.and 1000 Genome Project Data Processing小组。 (2009年)。 序列比对/地图格式和SAMtools。 生物信息学 25(16):2078-9。
  11. Li,P.E.,Lo,C.C.,Anderson,J.J.,Davenport,K.W.,Bishop-Lilly,K.A.,Xu,Y.,Ahmed,S.,Feng,S.,Mokashi,V.P.and Chain,P.S。(2017) 通过完全整合的基于网络的生物信息学平台实现基因组革命的民主化。 核酸研究45(1):67-80。
  12. Lo,C.C.和Chain,P.S。(2014)。 使用FaQC快速评估和质量控制下一代测序数据 BMC生物信息学 15:366.
  13. Lui,B。和Pop,M.(2009)。 ARDB - 抗生素抗性基因数据库。 核酸研究 > 37(数据库问题):D443-7。
  14. Otto,T.D。,Dillon,G.P.,Degrave,W.S。和Berriman,M。(2011)。 RATT:快速注释转移工具。 核酸研究 39(9):e57。
  15. Peng,Y.,Leung,H.C.,Yiu,S.M。和Chin,F.Y。(2012)。 IDBA-UD:一种从头组装的单细胞和宏基因组测序数据,具有高度不均匀的深度。 生物信息学 28(11):1420-1428。
  16. Price,M. N.,Dehal,P. S.和Arkin,A. P.(2010)。 FastTree 2 - 用于大型比对的近似最大似然树 PLoS一个5(3):e9490。
  17. Seemann,T。(2014)。 Prokka:快速原核生物基因组注释生物信息学 30( 14):2068-2069。
  18. Segata,N.,Waldron,L.,Ballarini,A.,Narasimhan,V.,Jousson,O。和Huttenhower,C。(2012)。 使用独特的分支特异性标记基因进行的宏基因组微生物群落分析 Nat方法 9(8):811-814。
  19. Skinner,M.E。,Uzilov,A.V.,Stein,L.D。,Mungall,C.J。和Holmes,I。(2009)。 JBrowse:下一代基因组浏览器。 Genome Res 19:1630-1638。
  20. Stamatakis,A.(2014)。 RAxML版本8:系统发育分析和大型系统发育的后分析工具。 Bioinformatics 30(9):1312-1313。
  21. Untergasser,A.,Cutcutache,I.,Koressaar,T.,Ye,J.,Faircloth,B.C.,Remm,M.and Rozen,S.G。(2012)。 Primer3 - 新的功能和接口。 Nucleic Acids Res 40(15):e115。
  22. Wood,D.E。和Salzberg,S.L。(2014)。 Kraken:使用精确比对的超快宏基因组序列分类。 Genome Biol 15(3):R46。
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2017 The Authors; exclusive licensee Bio-protocol LLC.
引用:Philipson, C., Davenport, K., Voegtly, L., Lo, C., Li, P., Xu, Y., Shakya, M., Cer, R. Z., Bishop-Lilly, K. A., Hamilton, T. and Chain, P. S. (2017). Brief Protocol for EDGE Bioinformatics: Analyzing Microbial and Metagenomic NGS Data. Bio-protocol 7(23): e2622. DOI: 10.21769/BioProtoc.2622.
提问与回复

(提问前,请先登录)bio-protocol作为媒介平台,会将您的问题转发给作者,并将作者的回复发送至您的邮箱(在bio-protocol注册时所用的邮箱)。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片或者视频的形式来说明遇到的问题。由于本平台用Youtube储存、播放视频,作者需要谷歌账户来上传视频。

当遇到任务问题时,强烈推荐您提交相关数据(如截屏或视频)。由于Bio-protocol使用Youtube存储、播放视频,如需上传视频,您可能需要一个谷歌账号。