搜索

Bioinformatic Analysis for Profiling Drug-induced Chromatin Modification Landscapes in Mouse Brain Using ChlP-seq Data
利用ChIP-seq数据分析药物诱导小鼠大脑中染色质修饰发生改变的生物信息学方法   

评审
匿名评审
引用 收藏 提问与回复 分享您的反馈 Cited by

本文章节

参见作者原研究论文

本实验方案简略版
Genome Biology
Dec 2012

Abstract

Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-seq) is a powerful technology to profile genome-wide chromatin modification patterns and is increasingly being used to study the molecular mechanisms of brain diseases such as drug addiction. This protocol discusses the typical procedures involved in ChIP-seq data generation, bioinformatic analysis, and interpretation of results, using a chronic cocaine treatment study as a template. We describe an experimental design that induces significant chromatin modifications in mouse brain, and the use of ChIP-seq to derive novel information about the chromatin regulatory mechanisms involved. We describe the bioinformatic methods used to preprocess the sequencing data, generate global enrichment profiles for specific histone modifications, identify enriched genomic loci, find differential modification sites, and perform functional analyses. These ChIP-seq analyses provide many details into the chromatin changes that are induced in brain by chronic exposure to cocaine, and generates an invaluable source of information to understand the molecular mechanisms underlying drug addiction. Our protocol provides a standardized procedure for data analysis and can serve as a starting point for any other ChIP-seq projects.

Keywords: Chromatin immunoprecipitation (ChIP) (染色质免疫沉淀(ChIP)), Next generation sequencing (NGS) (下一代测序(NGS)), ChIP-seq (ChIP-seq), Cocaine (可卡因), Bioinformatics (生物信息学), Epigenetics (表观遗传学), Histone modifications (组蛋白修饰)

Background

Chromatin modification has been implicated in the molecular mechanisms of drug addiction and may hold the key to understanding multiple aspects of addictive behaviors (Robison and Nestler, 2011). Chromatin Immuno-Precipitation (ChIP) followed by massively parallel sequencing (ChIP-seq) is the current state of the art technology to profile the chromatin landscape. The typical procedure of ChIP-seq involves: 1) using the antibody against a protein of interest to pull down the binding DNA, which has been fixed to the protein and broken into smaller fragments, 2) the immunoprecipitated DNA is then purified and constructed into a library for high throughput sequencing of short reads (usually 50-100 bp) from the ends of insert DNA fragments, 3) the short reads are aligned to the genome and put through data analysis. Compared with its predecessor – ChIP-chip, ChIP-seq has unparalleled advantages such as unbiased coverage of the entire genome, single base resolution, and significantly improved signal-to-noise ratio (Park, 2009). It has proven to be an invaluable tool to understand numerous types of chromatin modifications.

Brief overview of ChIP-seq experiment: Please see original research articles for greater experimental details (Renthal et al., 2009; Lee et al., 2006). Briefly, adult mice received a standard regimen of repeated cocaine (7 daily IP injections of cocaine [20 mg/kg] or saline) and were used 24 h after the last injection (Robison and Nestler, 2011). The nucleus accumbens, a major brain reward region, was obtained by punch dissection and used for ChIP-seq. Chromatin IP was performed as described previously (Renthal et al., 2009; Lee et al. 2006; http://jura.wi.mit.edu/young_public/hES_PRC/ChIP.html) with minor modifications, using two antibodies, anti-H3K4me3 (tri-methylation of Lys4 in histone H4) (Abcam #ab8580) and anti-H3K9me3 (Abcam #ab8898). Sequencing libraries for each experimental condition were generated in triplicate, and were then sequenced on an Illumina sequencer.

Bioinformatic analysis: The sequencing data generated from libraries under treatment and corresponding control conditions will be used to identify the specific genomic regions that have undergone chromatin changes. We can then associate these regions with biological functions and select the regions of interest for further study. The basic procedure for this kind of bioinformatic analysis can be laid out as a pipeline of eight steps (Figure 1):
1. Perform quality control analyses on the sequencing data files;
2. Align the sequences to the genome;
3. Remove PCR duplicates;
4. Perform cross-correlation quality analysis;
5. Generate coverage files to be loaded into a genome browser;
6. Generate global coverage plots;
7. Perform peak detection and annotation;
8. Perform differential analysis and functional association.

This protocol shall explain each of these steps in more detail, including some of the principles and rationale underlying the bioinformatics analyses, and provide the necessary commands for execution in a Unix-like command-line environment.


Figure 1. Flowchart of the ChIP-seq analysis pipeline

Equipment

  1. Personal computer or high performance computing cluster. All software mentioned here can be run under a Unix-like workstation, such as Linux and Mac. For Windows users, a terminal emulator, such as Cygwin (http://cygwin.com/), can be used.

Software

  1. FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
  2. Bowtie2 (Langmead and Salzberg, 2012; http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
  3. Samtools (Li et al., 2009; http://samtools.sourceforge.net/)
  4. phantompeakqualtools (http://code.google.com/archive/p/phantompeakqualtools/)
  5. IGV and IGVtools (Robinson et al., 2011; http://software.broadinstitute.org/software/igv/)
  6. ngs.plot (Shen et al., 2014; http://github.com/shenlab-sinai/ngsplot)
  7. MACS (Zhang et al., 2008; http://github.com/taoliu/MACS)
  8. diffReps (Shen et al., 2013; http://github.com/shenlab-sinai/diffreps)
  9. bedtools (Quinlan and Hall, 2010; http://bedtools.readthedocs.io/en/latest/)
  10. region-analysis (Shao et al., 2016; http://github.com/shenlab-sinai/region_analysis)
  11. ChIPseqRUs (Loh et al., 2016; https://github.com/shenlab-sinai/chip-seq_preprocess)
  12. David (Dennis et al., 2003; http://david.ncifcrf.gov/)
  13. IPA (http://www.ingenuity.com/)
  14. GREAT (McLean et al., 2010; http://bejerano.stanford.edu/great/public/html/)

Procedure

  1. Perform quality control analyses on the sequencing data files
    We strongly recommend checking the quality of the ChIP-seq sequencing data before moving on to the next steps. FastQC is an excellent program we use which performs a series of quality control steps for a next generation sequencing (NGS) dataset. Some of the more important quality reports to take note of in the evaluation of sequencing quality include the ‘Per base sequence quality’, which should show consistently high quality values across the reads; the ‘Per base sequence content’, which should show non-random distribution of the nucleotides at each base; and ‘Sequence duplication levels’, which should not be excessive. The FastQC documentation in the web link above provides some examples of reports of good and bad sequencing data may look like.

    fastqc -o path_to_output_directory -t number_of_threads_to_usesequence_file. fastq

    Note: In addition to command line execution, FastQC can also be run as an interactive graphical application. Also, definitions for command line parameters (e.g., ‘-o’, ‘-t’) for FastQC and all other programs used in this protocol can be found in their respective weblinks as provided in the ‘Software’ section above.
  2. Alignment of the sequences to the genome
    1. For each sample, we performed the alignment of the raw fastq sequence reads to the reference mouse genome using the Bowtie2 program. Bowtie2 generates alignment results in the Sequence Alignment/Map (SAM) format (Li et al., 2009), which contains alignment information including the short read ID, chromosome name, start position, strand, mismatch information, and the raw nucleotides of the read, among others.

      bowtie2 -p number_of_compute_cores_to_use -x path_to_mm9_genome_index -1 sequence_file_forward_reads.fastq -2  sequence_file_reverse_reads.fastq -S alignment_file.sam

      Note: In cases of single-end sequencing alignments, remove ‘-2 sequence_file_reverse_reads.fastq’ from the above command.
    2. From the SAM file generated, we need to extract only the reads that mapped uniquely to the genome. As multi-mapped reads are represented by Bowtie2 in the SAM file only by way of an ‘XS:i:value’ tag:value pair, we will use the Linux ‘grep’ command to extract all result lines that do not include ‘XS:i:’ in them.

      grep -v “XS:i:” alignment_file.sam > alignment_unique.sam

    3. The SAM format files are text files which typically end up very large and unwieldy for the tens of millions of reads aligned per sample. A companion to SAM, the Binary Alignment/Map (BAM) format (Li et al., 2009), contains the exact same information as in SAM, but enables efficient compression and storing of the alignment data, and also allows for efficient retrieval of the aligned reads. The BAM format is now widely accepted by most programs processing alignment data. Using the samtools program, we shall convert and sort the SAM file into the BAM format for use in subsequent steps.

      samtools view  -Sb alignment_unique.sam > alignment_unique.bam

      samtools sort alignment_unique.bam alignment_sorted

  3. Removing PCR duplicates
    1. The presence of potentially redundant short reads, caused by PCR amplification, represents an intrinsic limitation of second generation sequencing technology. Because PCR preferentially amplifies DNA fragments with certain nucleotide compositions, it makes some genomic locations overrepresented in a ChIP-seq library. While it may be argued that this overrepresentation bias applies equally to treatment and control conditions and can therefore be normalized, it actually perturbs the distribution of the underlying data and may inflate statistical significance. Take a hypothetical example (Table 1), where the actual number of fragments in library A and B are 1 and 3, respectively. If we simply multiply this number by ten then we will have 10 fragments in library A and 30 in library B. Both cases give us the same fold change of 3.0 but dramatically different p-values if assessed by a Chi-square test.

      Table 1. A hypothetical example to demonstrate the danger of inflating statistical significance without removing redundancy. The P-values are generated from the Chi-square test.


    2. To remove the PCR duplicates, we utilize the fact that, during the preparation of a ChIP-seq library, a sonicator breaks the whole genome almost uniformly so that most fragments come from unique positions (because a mammalian genome is ~2-3 billion bp long). This process may remove ‘innocent’ fragments that truly come from the same position and strand. But this happens less commonly in comparison with PCR duplication, and, as a general rule in statistical inference, it is always better to be conservative than to make false claims. Therefore, we have made removal of PCR duplicates a part of our ChIP-seq processing pipeline (Loh et al., 2016).

      samtools rmdup -s alignment_sorted.bam alignment_rmdup.bam

      Note: In cases of paired-end alignments, use the -S option in place of the -s option.
      Special note: When micrococcal nuclease is used in place of sonication during fragmentation step in ChIP preparation, the redundancy removing procedure should not be applied. This is because the strong preference for micrococcal nuclease to cut a DNA sequence at particular positions (with specific nucleotide compositions) can lead to a lot of genuine duplicated fragments. We are not aware of any available approach to specifically remove PCR-based duplication in such a library.
  4. Perform cross-correlation quality analysis
    The analysis of strand cross-correlation can be used as a ChIP-seq quality metric (Kharchenko et al., 2008; Landt et al., 2012). The principle behind this analysis is that high-quality ChIP-seq experiments are expected to generate clusters of mapped reads on the forward and reverse strands, with the ChIP binding site centered between them. By increasingly shifting the reads in the direction of the strand they map to and then calculating the between-strand Pearson correlation of read depth at all positions, the plot of cross-correlation coefficient against strand-shift usually yields two peaks, one corresponding to the read length (the so called ‘phantom’ peak) and the other to the average fragment length of the library. The absolute and relative heights of these peaks are then used to calculate two quality metrics, the normalized strand coefficient (NSC) and the relative strand correlation (RSC). The Encode Consortium ChIP-seq guidelines state that high quality ChIP-seq datasets typically have NSC > 1.1 and RSC > 1, and recommend the generation of additional replicates when NSC < 1.05 and RSC < 0.8 (Landt et al., 2012). Here, we use the run_SPP_nodups.R script of the phantompeakqualtools package to calculate the NSC and RSC values. An example of the NSC and RSC metrics and cross-correlation plot generated by the script is shown in Figure 2.
     
    run_spp_nodups.R -savp -c=alignment_rmdup.bam -out=phantompeak_output.txt

    Notes:
    1. We recommend assessing the NSC and RSC values with some caution. They tend to be more informative for chromatin marks and transcription factors with punctuated peaks than broad peaks. If your sequencing target is a chromatin mark with broad pattern peaks, such as the H3K9me2, the NSC and RSC values may fall below the Encode suggested criteria. On the other hand, we have found some input control samples to have very high NSC and RSC values.
    2. The ‘phantom’ peak is a phenomenon caused by biased mappability of the genome. When a genomic region has good mappability, so does its reverse complement. Therefore, both strands of the genomic region will get an equal amount of aligned reads. Because the read depth is calculated based on the 5’ ends of the aligned reads, a strand-shift of the read length will yield a local maximum – the ‘phantom’ peak.


      Figure 2. Plot of cross-correlation coefficient as a function of strand-shift for an example H3K27ac ChIP-seq sample, generated by the run_SPP_nodups.R script. The blue dotted line shows the location of the phantom peak (i.e., read length) and the red dotted lines show the best and the second-best guesses (the two numbers in parentheses) of the fragment length. The fragment length estimates are identified as local maxima.

  5. Generation of coverage files to be loaded into a genome browser
    The effective visualization of a ChIP-seq sample provides valuable information to a bench biologist (Figure 3). Through the use of a genome browser, an investigator can easily see which genomic regions are enriched by ChIP. In Figure 3, the y-axis gives the number of short reads aligned to that location, which can also be normalized by library size for comparing two or more ChIP-seq samples. This can be done by generating a so-called coverage file for each ChIP-seq sample. Many options are available in choosing a genome browser. Most genome browsers support its own file format that best suits its implementation, but often also supports some other common formats. The UCSC genome browser (Kent et al., 2002) is one of the earliest genome browsers and probably still the most comprehensive one. It is a web-based application and usually requires a user to upload the coverage files to a remote server. As sequencing technology advances rapidly, so does the size of coverage files generated from ChIP-seq samples. Therefore, it has become very inconvenient to upload those large files to a remote server. We therefore recommend using a genome browser that can work on a local machine. A very good choice is the IGV genome browser which is a Java-based cross-platform application. It supports a binary format called TDF and provides a utility program (igvtools) for users to generate a TDF file from a BAM format alignment file or several other commonly used formats. 

    igvtools count alignment_rmdup.bam alignment_rmdup.tdf mm9


    Figure 3. Coverage plot of a histone mark, H3K9me3, after repeated cocaine or saline conditions shown in the IGV genome browser (Robinson et al., 2011). Y-axis represents the normalized coverage of DNA fragments enriched by the antibody against H3K9me3. This is a peak of ~9 Kb long located on chromosome 8, which shows reduced binding after chronic cocaine treatment in mouse nucleus accumbens (NAc) (Maze et al., 2011).

  6. Generation of global coverage plots
    A global coverage plot that shows the averaged coverage over all instances of annotated genomic features, such as transcription start sites (TSS), is often extremely helpful in determining enrichment, assessing IP efficiency, or correlating with gene groups. An example is shown in Figure 4, which demonstrates that the enrichment of H3K4me3 correlates positively with gene expression levels. The generation of such a plot is straightforward. The basic principle is first to calculate the genomic coverage vector for a ChIP-seq sample. One then retrieves the genomic coordinates for the selected genomic feature from an annotation database. With the genomic coordinates, the coverage values are then extracted and averaged. We have developed an open source software package called ngs.plot, based on the statistical package R, that would generate these global coverage plots. The command shown below executes a basic ngs.plot analysis with the four mandatory arguments required. Additional customizations of the analysis and plotting can be performed with additional arguments (https://github.com/shenlab-sinai/ngsplot/wiki/weblink). ngs.plot is also available as an add-on to Galaxy (Goecks et al., 2010), a popular web-based genomic analysis framework.

    ngs.plot.r -G mm9 -R tss -C alignment_rmdup.bam –O outputfile_prefix
     
    Note: All the above-mentioned steps 1-6 can be considered as pre-processing steps that would be needed to be applied to all samples. While the individual command lines to run each step have been provided here, we have also developed an open source automated pipeline (Loh et al., 2016) (Available at http://github.com/shenlab-sinai/chip-seq_preprocess/) that would automatically run all these steps on all samples. The pipeline also features parallel processing and automatic resume of interrupted jobs. Please refer to the URL above for instructions on how to install and use the pipeline, which would not be covered here. 


    Figure 4. Global TSS plot for a histone mark, H3K4me3, in mouse NAc under saline condition correlating with gene expression levels. H3K4me3 is known as a transcriptional activation mark and shows a strong positive correlation with gene expression levels. Here, four gene groups – High, Medium 1, Medium 2, and Low are defined by normalized RNA-seq read counts. The four gene groups are ranked by their expression levels from high to low and each group contains 1,000 genes randomly selected from the genome.

  7. Peak detection and annotation
    Determining binding enrichments for a ChIP-seq library with rigorous statistical evaluation (peak calling) is often extremely useful. Global statistics can be obtained on the peaks to determine their distribution (Figure 5). One can also select top ranked peaks for follow up, or correlate the peak list with functional annotations or gene expression. Peak calling for ChIP-seq has been a heavily researched area in the past few years, with dozens of software tools developed to date (for reviews see [Szalkowski and Schmid, 2010; Pepke et al., 2009; Wilbanks and Facciotti, 2010]). The basic principle of any peak calling program can be summarized as follows: 1) the whole genome is binned into small intervals of fixed size, such as 200 bp; 2) the number of short reads in each bin is used to build a null distribution to describe non-binding regions, typically using Poisson or Negative Binomial (NB) distributions; 3) a sliding window is used to scan the whole genome and identify regions that exceed pre-determined statistical cutoffs. As we examine this procedure, we can see that the most important part of peak calling is in the building of the null distribution that describes a ChIP-seq background. This is a non-trivial task because the short read counts of a ChIP-seq library do not follow a unified random process. The complexity is caused by many factors, but some of the most important ones are: uneven packaging of the chromatin into nucleosomes, PCR bias, and sequencing imperfection (Chen et al., 2012). Modeling these effects by DNA sequences alone is an extremely complicated task, if it is even possible. Therefore, it is always recommended to couple several IP libraries with an input library with the same preparation procedure. One can then easily compare the normalized read counts from the IP and input libraries and determine the enrichment of IP. A very popular peak calling program is MACS, which can be used with or without inputs. MACS utilizes a local Poisson distribution to model the background of a ChIP-seq sample. When working without input, it uses regional counts from IP as background. This is prone to inaccuracy because there is no way to separate the short reads coming from IP and non-specific binding. When an input is available, MACS compares IP with input to identify enrichment with better sensitivity and specificity than IP alone.

    macs2 callpeak -t alignment_rmdup.bam -c input.bam -f BAM -g mm -n outputfile_prefix --bw 150 -q 0.05


    Figure 5. Peak distribution of two histone marks, (A) H3K4me3 and (B) H3K9me3, across the whole genome using existing annotation of gene features. H3K4me3 is a ‘genic’ mark with most of its peaks located around TSSs or on gene bodies. In contrast, H3K9me3 is a ‘non-genic’ mark with most of its peaks located in intergenic regions (Maze et al., 2011).

  8. Differential analysis and functional association
    1. As we are interested in the characterization of drug-induced chromatin modifications, it is of high interest to compare the two IP libraries of treatment and control and identify differences in binding, or in other words, differential sites. This can be done by comparing the two peak lists from treatment and control and making presence/absence calls. However, we have found this approach to be ineffective for our data because the drug-induced chromatin changes in brain are often relatively small, presumably due to the heterogeneity of the tissue compared, for example, to cell culture systems. Typically, one may observe a peak to be present in both drug-treated and control conditions, but some of the peaks do show significant differences in magnitude (Figure 6A). In addition, there are situations where a peak is not significantly more enriched in one condition than in the comparison condition but still classified as different because an arbitrary cutoff is chosen (Figure 6B). A refinement of this approach is to identify peaks in both conditions and make quantitative assessment on their binding enrichments (Liang and Keles, 2012). This is a significant improvement to the first approach and can usually generate much more reliable differential sites than the former. The approach should work well for transcription factor-like histone marks. However, for some histone marks, this can still be suboptimal. For example, marks like H3K9me3 typically produce peaks as long as several Kb, while differences can be observed well within a peak (Figure 6C). Comparing two peaks can be ineffective in identifying this kind of changes. To deal with the above challenges, we have developed our own program – diffReps to detect differential chromatin modification regions between two conditions with high definition. diffReps employs a sliding window approach to identify all regions that show significant changes. With a tunable window size, it allows one to look for either large blocks or small regions of chromatin modifications. diffReps also takes into account the biological variations within each condition and provides a few statistical tests to choose from for assessing the differences. When biological replicates are available, we recommend the use of NB test which models the over-dispersion (large variation) of discrete counting data within a group. NB test is usually much more sensitive than t-test in detecting differential sites for ChIP-seq data. It also avoids the problem of making false positives on small count data. When there is only one sample in each group, one can choose either Pearson Chi-square or G-test for differential analysis. The following example shows a diffReps run to detect differential sites by comparing two experimental conditions where each condition contains three BED files. Input samples for each condition are also available and used. We use a window size of 200 bp to obtain a high-resolution profiling of the chromatin modification landscape and NB test to assess statistical significance.

      diffReps.pl --tr T1.bed T2.bed T3.bed --co C1.bed C2.bed C3.bed --btr T_input.bed --bco C_input.bed --re diffsite.txt --gname mm9 --me nb --wi 200 --st 20

      Note: diffReps requires input files to be in the BED format (https://genome.ucsc.edu/FAQ/FAQformat#format1), which can be converted from the BAM format using the bedtools package (Quinlan and Hall, 2010) using the command:

      bedtools bamtobed -i inputfile.bam > outputfile.bed


      Figure 6. Challenges in detecting differential sites from ChIP-seq data by comparing two conditions. A. Two peaks are both above significance cutoff but show differences in binding. B. One peak is above cutoff and the other peak is below cutoff but they are not significantly different. C. There is a region with significant difference within a peak but may not be detected when the whole peak is considered.

    2. The final step in our analysis is the functional annotation for a peak list or differential site list. This has traditionally been done by first annotating each genomic region to the closest gene and then testing the enrichment of functional terms among this Goecks gene group. To annotate each called peak or differential site according to the type of genomic feature on which they are located, the region_analysis program is used. The region_analysis program is able to annotate any text file where the first three columns are chromosome, start and end positions respectively.

      region_analysis.py -i input_file -g mm9 -d refseq –r

    3. Gene enrichment analysis of the gene lists annotated to peaks or differential sites can be accomplished with various open source or commercially available tools. A very popular open source tool is DAVID, while a commercially available alternative is Ingenuity Pathway Analysis (IPA). Additionally, another development in functional interpretation of peaks is to overlap them with gene regulatory regions which can either be proximal regions at TSS or distal regions as far as 1 Mb away. This approach is useful because, with the advent of ChIP-seq, we are now able to profile genomic regions that are not restricted to promoters. Potential genomic regions that may contain regulatory information include distant enhancers as well as gene bodies. This functionality greatly enhances our ability to reveal the previously unknown functions of a histone mark or a DNA binding protein. A representative tool in this category is called ‘Genomic Regions Enrichment of Annotations Tool’ (GREAT). GREAT simply requires the input of a BED file which can be easily obtained from peak calling or diffReps outputs. These analysis tools are mainly web-based and interested readers are referred to their websites for more information.
  9. Anticipated results
    The ChIP-seq protocol outlined here is an extremely powerful and comprehensive technique for characterizing chromatin modification landscapes for histone marks in brain. The ChIP-seq data not only allows us to look at traditional regulatory regions – promoters, but also enables us to investigate remote regulatory sites, such as enhancers and exons. With this enhanced capability, we have now entered a new era to study disease-related chromatin modification changes with unprecedented detail.

Data analysis

For some concrete examples of using the above analytic procedures on a dataset of seven histone modification marks and RNA-seq to study cocaine-induced epigenomic and transcriptomic changes in mouse brain, readers are referred to our previous study (Feng et al., 2014).

Acknowledgments

This work is supported by grants: P50MH096890 and P01DA008227.

References

  1. Chen, Y., Negre, N., Li, Q., Mieczkowska, J. O., Slattery, M., Liu, T., Zhang, Y., Kim, T. K., He, H. H., Zieba, J., Ruan, Y., Bickel, P. J., Myers, R. M., Wold, B. J., White, K. P., Lieb, J. D. and Liu, X. S. (2012). Systematic evaluation of factors influencing ChIP-seq fidelity. Nat Methods 9(6): 609-614.
  2. Dennis, G., Jr., Sherman, B. T., Hosack, D. A., Yang, J., Gao, W., Lane, H. C. and Lempicki, R. A. (2003). DAVID: database for annotation, visualization, and integrated discovery. Genome Biol 4(5): P3.
  3. Feng, J., Wilkinson, M., Liu, X., Purushothaman, I., Ferguson, D., Vialou, V., Maze, I., Shao, N., Kennedy, P., Koo, J., Dias, C., Laitman, B., Stockman, V., LaPlant, Q., Cahill, M.E., Nestler, E.J., and Shen, L. (2014) Chronic cocaine-regulated epigenomic changes in mouse nucleus accumbens. Genome Biol 15(4):R65.
  4. Goecks, J., Nekrutenko, A., Taylor, J. and Galaxy, T. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11(8): R86.
  5. Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M. and Haussler, D. (2002). The human genome browser at UCSC. Genome Res 12(6): 996-1006.
  6. Kharchenko, P. V., Tolstorukov, M. Y. and Park, P. J. (2008). Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol 26(12): 1351-1359.
  7. Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B. E., Bickel, P., Brown, J. B., Cayting, P., Chen, Y., DeSalvo, G., Epstein, C., Fisher-Aylor, K. I., Euskirchen, G., Gerstein, M., Gertz, J., Hartemink, A. J., Hoffman, M. M., Iyer, V. R., Jung, Y. L., Karmakar, S., Kellis, M., Kharchenko, P. V., Li, Q., Liu, T., Liu, X. S., Ma, L., Milosavljevic, A., Myers, R. M., Park, P. J., Pazin, M. J., Perry, M. D., Raha, D., Reddy, T. E., Rozowsky, J., Shoresh, N., Sidow, A., Slattery, M., Stamatoyannopoulos, J. A., Tolstorukov, M. Y., White, K. P., Xi, S., Farnham, P. J., Lieb, J. D., Wold, B. J. and Snyder, M. (2012). ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res 22(9): 1813-1831.
  8. Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods 9(4): 357-359.
  9. Lee, T. I., Jenner, R. G., Boyer, L. A., Guenther, M. G., Levine, S. S., Kumar, R. M., Chevalier, B., Johnstone, S. E., Cole, M. F., Isono, K., Koseki, H., Fuchikami, T., Abe, K., Murray, H. L., Zucker, J. P., Yuan, B., Bell, G. W., Herbolsheimer, E., Hannett, N. M., Sun, K., Odom, D. T., Otte, A. P., Volkert, T. L., Bartel, D. P., Melton, D. A., Gifford, D. K., Jaenisch, R. and Young, R. A. (2006). Control of developmental regulators by Polycomb in human embryonic stem cells. Cell 125(2): 301-313.
  10. Liang, K. and Keles, S. (2012). Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics 28(1): 121-122.
  11. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing Subgroup. (2009). The sequence alignment/map format and SAMtools. Bioinformatics 25(16): 2078-2079.
  12. Loh, Y. H., Shao, N. Y. and Shen, L. (2016). ChIPseqRUs: a pipeline for ChIP-seq preprocessing. Github repository. Zenodo.
  13. Maze, I., Feng, J., Wilkinson, M. B., Sun, H., Shen, L. and Nestler, E. J. (2011). Cocaine dynamically regulates heterochromatin and repetitive element unsilencing in nucleus accumbens. Proc Natl Acad Sci U S A 108(7): 3035-3040.
  14. McLean, C. Y., Bristor, D., Hiller, M., Clarke, S. L., Schaar, B. T., Lowe, C. B., Wenger, A. M. and Bejerano, G. (2010). GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol 28(5): 495-501.
  15. Park, P. J. (2009). ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 10(10): 669-680.
  16. Pepke, S., Wold, B. and Mortazavi, A. (2009). Computation for ChIP-seq and RNA-seq studies. Nat Methods 6(11 Suppl): S22-32.
  17. Quinlan, A. R. and Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6): 841-842.
  18. Renthal, W., Kumar, A., Xiao, G., Wilkinson, M., Covington, H. E., 3rd, Maze, I., Sikder, D., Robison, A. J., LaPlant, Q., Dietz, D. M., Russo, S. J., Vialou, V., Chakravarty, S., Kodadek, T. J., Stack, A., Kabbaj, M. and Nestler, E. J. (2009). Genome-wide analysis of chromatin regulation by cocaine reveals a role for sirtuins. Neuron 62(3): 335-348.
  19. Robison, A. J. and Nestler, E. J. (2011). Transcriptional and epigenetic mechanisms of addiction. Nat Rev Neurosci 12(11): 623-637.
  20. Robinson, J. T., Thorvaldsdottir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G. and Mesirov, J. P. (2011). Integrative genomics viewer. Nat Biotechnol 29(1): 24-26.
  21. Shao, N. Y., Loh, Y. H. and Shen, L. (2016). Region-analysis: a python program for genomic interval annotations. Github repository. Zenodo.
  22. Shen, L., Shao, N. Y., Liu, X., Maze, I., Feng, J. and Nestler, E. J. (2013). DiffReps: detecting differential chromatin modification sites from ChIP-seq data with biological replicates. PLoS One 8(6): e65598.
  23. Shen, L., Shao, N., Liu, X. and Nestler, E. (2014). ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases. BMC Genomics 15: 284.
  24. Szalkowski, A. M. and Schmid, C. D. (2011). Rapid innovation in ChIP-seq peak-calling algorithms is outdistancing benchmarking efforts. Brief Bioinform 12(6): 626-633.
  25. Wilbanks, E. G. and Facciotti, M. T. (2010). Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One 5(7): e11471.
  26. Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E., Nusbaum, C., Myers, R. M., Brown, M., Li, W. and Liu, X. S. (2008). Model-based analysis of ChIP-Seq (MACS). Genome Biol 9(9): R137.

简介

染色质免疫沉淀,然后进行大规模并行测序(ChIP-seq)是一个强大的技术,用于描绘全基因组染色质修饰模式,并且越来越多地用于研究脑疾病如药物成瘾的分子机制。该协议讨论了使用慢性可卡因治疗研究作为模板的ChIP-seq数据生成,生物信息分析和结果解释中涉及的典型程序。我们描述了一种在小鼠脑中诱导显着染色质修饰的实验设计,并且使用ChIP-seq获得关于所涉及的染色质调节机制的新颖信息。我们描述用于预处理测序数据的生物信息学方法,产生特定组蛋白修饰的全局富集曲线,鉴定富集的基因组位点,找到差异修饰位点,并进行功能分析。这些ChIP-seq分析为慢性暴露于可卡因引起的大脑染色质变化提供了许多细节,并产生了无价的信息来源,以了解吸毒成瘾的分子机制。我们的协议提供了一个标准化的数据分析程序,可以作为任何其他ChIP-seq项目的起点。

背景 染色质修饰已经涉及药物成瘾的分子机制,并且可能成为理解成瘾行为的多个方面的关键(Robison and Nestler,2011)。染色质免疫沉淀(ChIP),随后是大规模并行测序(ChIP-seq)是目前现有技术,用于描绘染色质景观。 ChIP-seq的典型程序包括:1)使用针对目标蛋白的抗体将已经固定在蛋白质上的结合DNA拉碎成较小的片段,然后将免疫沉淀的DNA纯化并构建成用于从插入DNA片段的末端进行短读取(通常为50-100bp)的高通量测序库,3)将短读数与基因组比对并进行数据分析。与其前身ChIP芯片相比,ChIP-seq具有无与伦比的优势,如全基因组的无偏见,单碱基分辨率和显着提高的信噪比(Park,2009)。已被证明是理解许多类型的染色质修饰的宝贵工具。
&nbsp; &nbsp; ChIP-seq实验的简要概述:请参阅原始研究文章以获得更多的实验细节(Renthal等人,2009; Lee等人,2006)。简而言之,成年小鼠接受重复可卡因(7次IP注射可卡因[20mg / kg]或盐水)的标准方案,并在最后一次注射后24小时使用(Robison and Nestler,2011)。通过打孔解剖获得伏隔核,一个主要的脑回报区域,用于ChIP-seq。染色质IP如前所述进行(Renthal等人,2009; Lee等人,2006; http://jura.wi.mit.edu/young_public /hES_PRC/ChIP.html ),稍作修改,使用两种抗体, H3K4me3(组蛋白H4中的Lys4的三甲基化)(Abcam#ab8580)和抗H3K9me3(Abcam#ab8898)。每个实验条件的测序库一式三份生成,然后在Illumina测序仪上测序。
&nbsp; &nbsp;生物信息学分析:在处理和相应对照条件下从文库产生的测序数据将用于鉴定已经经历染色质变化的特定基因组区域。然后,我们可以将这些区域与生物功能相关联,并选择感兴趣的区域进行进一步研究。这种生物信息学分析的基本步骤可以作为八个步骤的流水线布置(图1):
1.对排序数据文件执行质量控制分析;
2.将序列与基因组对齐;
3.删除PCR重复;
4.进行互相关质量分析;
5.生成要加载到基因组浏览器中的覆盖文件;
6.生成全球覆盖图;
7.执行峰值检测和注释;
8.进行差异分析和功能关联。
&nbsp; &nbsp;该协议将更详细地解释这些步骤,包括生物信息学分析基础的一些原则和理由,并提供在类Unix命令行环境中执行的必要命令。


图1. ChIP-seq分析流水线的流程图

关键字:染色质免疫沉淀(ChIP), 下一代测序(NGS), ChIP-seq, 可卡因, 生物信息学, 表观遗传学, 组蛋白修饰

设备

  1. 个人计算机或高性能计算集群。这里提到的所有软件都可以在类似Unix的工作站(如Linux和Mac)下运行。对于Windows用户,终端模拟器(如Cygwin)( http://cygwin.com/),可以使用。

软件

  1. FastQC( http://www.bioinformatics.babraham.ac .uk/projects/fastqc/
  2. Bowtie2(Langmead和Salzberg,2012; http://bowtie- bio.sourceforge.net/bowtie2/index.shtml
  3. Samtools(Li&et al。,et al。,2009; http://samtools .sourceforge.net/
  4. 幻影工具( http://code.google.com/archive/p/phantompeakqualtools/
  5. IGV和IGVtools(Robinson等人,2011; http://software.broadinstitute.org/software/igv/
  6. ngs.plot(Shen et al。,2014; http://github.com/shenlab-sinai/ngsplot
  7. MACS(Zhang等人,2008; http://github.com/taoliu/MACS
  8. diffReps(Shen等),2013; http: //github.com/shenlab-sinai/diffreps
  9. 床上用品(Quinlan and Hall,2010; http://bedtools.readthedocs.io/en/latest/
  10. 区域分析(Shao等人,,2016; http://github.com/shenlab-sinai/region_analysis
  11. ChIPseqRUs(Loh et al。,2016; https://github.com/shenlab-sinai/chip-seq_preprocess
  12. David(Dennis et al。,2003; http://david .ncifcrf.gov/
  13. IPA( http://www.ingenuity.com/
  14. GREAT(McLean等人,,2010; http://bejerano.stanford.edu/great/public/html/

程序

  1. 对排序数据文件执行质量控制分析
    我们强烈建议您检查ChIP-seq排序数据的质量,然后再转到下一步。 FastQC是我们使用的一个优秀的程序,它为下一代排序(NGS)数据集执行一系列质量控制步骤。在评估测序质量方面需要注意的一些更重要的质量报告包括"每碱基序列质量",它们应在整个阅读中显示出一贯的高品质价值; "每碱基序列含量",应显示每个碱基上核苷酸的非随机分布;和"序列重复级别",这不应该过多。上面的Web链接中的FastQC文档提供了一些好的和坏的排序数据可能看起来的报告的例子。

    fastqc -o path_to_output_directory -t number_of_threads_to_usesequence_file。 fastq

    注意:除了执行命令行,FastQC也可以作为交互式图形应用程序运行。此外,FastQC和本协议中使用的所有其他程序的命令行参数(例如,"-o","-t")的定义可以在上述"软件"部分提供的各自的网页链接中找到。 >
  2. 序列与基因组的比对
    1. 对于每个样品,我们使用Bowtie2程序进行原始fastq序列读取与参考小鼠基因组的比对。 Bowtie2在序列比对/映射(SAM)格式(Li et al。,2009)中生成对齐结果,其中包含对齐信息,包括短读ID,染色体名称,起始位置,链,失配信息,以及阅读的原始核苷酸等
      bowtie2 -p number_of_compute_cores_to_use -x -1 sequence_file_reverse_reads.fastq -S alignment_file.sam

      注意:在单端排序比对的情况下,从上述命令中删除'-2 sequence_file_reverse_reads.fastq'。
    2. 从生成的SAM文件中,我们只需要提取唯一映射到基因组的读取。由于多映射读取由SAM文件中的Bowtie2仅通过"XS:i:value"标签:值对表示,我们将使用linux'grep'命令来提取不包含"XS"的所有结果行:我们在他们里面
      grep -v"XS:i:" align_file.sam > align_unique.sam

    3. SAM格式文件是文本文件,其通常最终对于每个样本对齐的数千万次读取来说非常大且笨重。 SAM,二进制对齐/映射(BAM)格式(Li et al。,2009)的伴侣包含与SAM完全相同的信息,但能够有效地压缩和存储对齐数据,并且还允许有效地检索对准的读取。 BAM格式现在被大多数程序处理对齐数据广泛接受。使用samtools程序,我们将把SAM文件转换并排序成BAM格式,以便后续步骤使用。

      samtools view  -Sb alignment_unique.sam > align_unique.bam

      samtools sort align_unique.bam alignment_sorted

  3. 删除PCR重复项
    1. 由PCR扩增引起的潜在冗余短读的存在代表了第二代测序技术的内在限制。因为PCR优先放大具有某些核苷酸组成的DNA片段,所以使一些基因组位置在ChIP-seq文库中超过表达。虽然可以认为这种超额代表偏见同样适用于治疗和控制条件,因此可以归一化,但实际上扰乱了潜在数据的分布,并可能增加统计学意义。假设一个例子(表1),其中文库A和B中实际的片段数分别为1和3。如果我们简单地将这个数乘以10,那么我们将在库A中有10个片段,在库B中将有10个片段。如果通过卡方检验进行评估,这两个病例给出了3.0的相同的倍数变化,但是显着不同的p值。 />
      表1.一个假设的例子,以证明在不消除冗余的情况下膨胀统计学意义的危险。 从卡方检验生成 P 值。


    2. 为了去除PCR重复,我们利用这样的事实,即在制备ChIP-seq文库期间,超声波仪几乎均匀地打破整个基因组,使得大多数片段来自独特的位置(因为哺乳动物基因组约2-3亿bp长)。这个过程可能会消除真正来自同一位置和股份的"无辜"碎片。但是,与PCR重复相比,这种情况较少发生,而作为统计学推断的一般规律,总是比做出虚假声明更保守。因此,我们已经删除了PCR重复的一部分ChIP-seq处理流水线(Loh等人,2016)。

      samtools rmdup -s <对齐__sm.bam align_rmdup.bam

      注意:在配对结束对齐的情况下,请使用-S选项代替-s选项。
      特别注意:在ChIP制备中的碎裂步骤中使用微球菌核酸酶代替超声处理时,不应使用冗余去除程序。这是因为对于微球菌核酸酶在特定位置(具有特定核苷酸组成)切割DNA序列的强烈偏好可能导致许多真正重复的片段。我们不知道在这样一个库中特别删除基于PCR的重复的任何可用的方法。
  4. 执行互相关质量分析
    链互相关的分析可以用作ChIP-seq质量度量(Kharchenko等人,2008; Landt等人,2012)。这种分析的原理是,高质量的ChIP-seq实验预计将产生前向和反向链上的映射读取集群,其中ChIP绑定位点位于它们之间。通过越来越多地将读取方向转移到它们映射到的线的方向上,然后计算所有位置上的读取深度的两股之间的Pearson相关性,互相关系数对股线移位的图形通常产生两个峰值,一个对应于读取长度(所谓的"幻影"峰值),另一个到图书馆的平均片段长度。然后将这些峰的绝对和相对高度用于计算两个质量度量,标准化链系数(NSC)和相对链相关性(RSC)。编码联盟ChIP-seq指南指出,高质量的ChIP-seq数据集通常具有NSC> 1.1和RSC> 1,并且推荐在NSC < 1.05和RSC < 0.8(Landt等人,2012)。在这里,我们使用phantompeakqualtools包的run_SPP_nodups.R脚本来计算NSC和RSC值。图2中显示了由脚本生成的NSC和RSC度量和互相关图的示例  
    > run_spp_nodups.R -savp -c = align_rmdup.bam -out = phantompeak_output.txt br />
    注意:
    1. 我们建议谨慎地评估NSC和RSC值。它们对于染色质标记和转录因子往往更具信息,其峰值峰值高于宽峰。如果您的测序目标是具有广泛模式峰的染色质标记,例如H3K9me2,则NSC和RSC值可能会低于Encode建议标准。另一方面,我们发现一些输入控制样本具有非常高的NSC和RSC值。
    2. "幻影"峰值是由基因组的偏置映射引起的现象。当基因组区域具有良好的映射性时,其反向互补也是如此。因此,基因组区域的两条链将获得相等量的对齐读数。因为读取深度是根据对齐读数的5'端计算出来的,读取长度的变换将产生局部最大值"幻影"峰值。



      图2.由run_SPP_nodups.R脚本生成的示例H3K27ac ChIP-seq示例的互相关系数作为链变换的函数。 蓝色虚线显示幻影峰( ie ,读取长度)的位置,红色虚线显示最佳和第二好的猜测(括号中的两个数字)片段长度。片段长度估计被确定为局部最大值。

  5. 生成要载入基因组浏览器的覆盖文件
    ChIP-seq样品的有效可视化为台式生物学家提供了有价值的信息(图3)。通过使用基因组浏览器,研究者可以很容易地看到ChIP富集了哪些基因组区域。在图3中,y轴给出与该位置对准的短读数,其也可以通过用于比较两个或更多个ChIP-seq样本的库大小进行归一化。这可以通过为每个ChIP-seq样本生成所谓的覆盖文件来完成。选择基因组浏览器有很多选择。大多数基因组浏览器支持最适合其实现的自己的文件格式,但通常还支持一些其他常见格式。 UCSC基因组浏览器(Kent等人,2002)是最早的基因组浏览器之一,可能仍然是最全面的基因组浏览器之一。它是一个基于Web的应用程序,通常需要用户将覆盖文件上传到远程服务器。随着测序技术的迅速发展,ChIP-seq样本产生的覆盖文件的大小也将随之而来。因此,将这些大文件上传到远程服务器变得非常不方便。因此,我们建议使用可在本地机器上工作的基因组浏览器。一个非常好的选择是基于Java的跨平台应用程序的IGV基因组浏览器。它支持称为TDF的二进制格式,并为用户提供了一个实用程序(igvtools),用于从BAM格式对齐文件或其他常用格式生成TDF文件。

    igvtools count align_rmdup.bam alignment_rmdup.tdf mm9


    图3.在IGV基因组浏览器中显示的重复可卡因或盐水条件后组蛋白标记H3K9me3的覆盖图(Robinson等人,2011)。 Y轴表示由针对H3K9me3的抗体富集的DNA片段的归一化覆盖。这是位于染色体8上的〜9Kb长的峰,其显示了在小鼠伏隔核(NAc)中的慢性可卡因治疗后的结合减少(Maze等人,2011)。

  6. 生成全球覆盖图
    显示所有注释基因组特征实例(如转录起始位点(TSS))的平均覆盖率的全局覆盖率通常对于确定浓缩,评估IP效率或与基因组相关性非常有帮助。图4显示了一个例子,说明H3K4me3的富集与基因表达水平呈正相关。这种情节的产生是直接的。基本原理是首先计算ChIP-seq样品的基因组覆盖载体。然后,从注释数据库中检索所选基因组特征的基因组坐标。使用基因组坐标,然后提取并平均覆盖值。我们已经开发了一个名为ngs.plot的开源软件包,基于统计包R,可以生成这些全局覆盖图。下面显示的命令执行基本的ngs.plot分析,其中包含四个必需参数。分析和绘图的其他自定义可以使用其他参数( https ://github.com/shenlab-sinai/ngsplot/wiki weblink )。 ngs.plot也可作为Galaxy(Goecks等人,2010)的附加组件,这是一种流行的基于Web的基因组分析框架。

    ngs.plot.r -G mm9 -R tss -C alignment_rmdup.bam -O outputfile_prefix
     
    注意:所有上述步骤1-6可以被认为是需要应用于所有样品的预处理步骤。虽然这里提供了每个步骤的单独命令行,我们还开发了一个开源自动流水线(Loh et al。,2016)(Available at http://github.com/shenlab-sinai/chip-seq_preprocess/),将自动对所有样本运行所有这些步骤。管道还具有并行处理和自动恢复中断的作业。请参阅上面的URL以了解如何安装和使用管道的说明,这里不再赘述。 


    图4.组织蛋白标记H3K4me3的全球TSS图,小鼠NAc在盐水条件下与基因表达水平相关。 H3K4me3被称为转录激活标记,与基因表达水平呈强正相关。在这里,通过标准化的RNA-seq读数计数定义了4个基因组 - 高,中1,中2和低。四个基因组的表达水平从高到低排列,每组包含从基因组中随机选择的1,000个基因
  7. 峰值检测和注释
    通过严格的统计评估(峰值调用)确定ChIP-seq文库的结合浓缩通常是非常有用的。可以在峰上获得全球统计数据,以确定其分布(图5)。还可以选择排名最高的峰值进行跟踪,或将峰值列表与功能注释或基因表达相关联。峰值呼叫ChIP-seq在过去几年中一直是一个研究重点的领域,迄今为止开发了数十种软件工具(参见[Szalkowski和Schmid,2010; Pepke等人,2009年) ; Wilbanks and Facciotti,2010])。任何峰值调用程序的基本原理可概括如下:1)将全基因组分为固定大小的小间隔,如200 bp; 2)每个仓中的短读数用于构建零分布以描述非绑定区域,通常使用泊松或负二项(NB)分布; 3)使用滑动窗口扫描整个基因组并识别超过预定的统计截止值的区域。当我们研究这个过程时,我们可以看到,峰值调用的最重要部分是构建描述ChIP-seq背景的零分布。这是一个不平凡的任务,因为ChIP-seq库的简短读取计数不遵循统一的随机过程。复杂性是由许多因素引起的,但一些最重要的因素是:染色质不均匀包装在核小体中,PCR偏差和测序缺陷(Chen et al。,2012)。通过DNA序列单独建模这些效果是一个非常复杂的任务,如果它是可能的话。因此,始终建议使用相同的准备过程将多个IP库与输入库耦合。然后可以轻松地比较来自IP和输入库的归一化读取计数,并确定IP的浓缩。非常受欢迎的峰值调用程序是MACS,可以使用或不使用输入。 MACS利用本地泊松分布来模拟ChIP-seq样本的背景。在没有输入的情况下工作时,它使用IP的区域计数作为背景。这很容易导致不准确,因为没有办法分离来自IP和非特定绑定的短读。当输入可用时,MACS将IP与输入进行比较,从而识别浓度比单独的IP更好的灵敏度和特异性。

    macs2 callpeak -t align_rmdup.bam -c input.bam -f BAM -g mm -n outputfile_prefix - bw 150 -q 0.05


    图5.使用现有的基因特征注释,两个组蛋白标记(A)H3K4me3和(B)H3K9me3)在整个基因组上的峰值分布。H3K4me3是一个"基因"标记,其大多数峰位于TSS周围或基因体上。相比之下,H3K9me3是一个"非基因"标记,其大部分峰位于基因间区域(Maze et al。,2011)。

  8. 差分分析与功能关联
    1. 由于我们对药物诱导的染色质修饰的表征感兴趣,因此比较两种IP库的治疗和控制是非常有意义的,并且识别结合的差异,或者换句话说,差异位点。这可以通过比较来自治疗和控制的两个峰值列表以及存在/不存在呼叫来完成。然而,我们发现这种方法对我们的数据无效,因为药物诱导的大脑中的染色质变化通常相对较小,这可能是由于组织的异质性,例如与细胞培养系统相比。通常,人们可以观察到在药物治疗和对照条件下存在的峰,但是一些峰在幅度上显示出显着的差异(图6A)。另外,存在在一个条件下峰值在比较条件下不显着更丰富的情况,但是由于选择任意的截止值而仍被分类为不同(图6B)。这种方法的改进是确定两种条件下的峰值,并对其结合浓缩进行定量评估(Liang and Keles,2012)。这是第一种方法的显着改进,通常可以产生比前者更可靠的差异性位点。该方法应该适用于转录因子样组蛋白标记。然而,对于某些组蛋白标记,这仍然可能不是最佳的。例如,像H3K9me3这样的标记通常产生高达几个Kb的峰,而峰可以很好地观察到差异(图6C)。比较两个峰值可能无法识别这种变化。为了应对上述挑战,我们开发了我们自己的程序diffReps来检测两个高清晰度条件之间的差异染色质修饰区域。 diffReps采用滑动窗口方法来识别出显着变化的所有区域。使用可调整的窗口大小,可以查找大块或染色质修饰的小区域。 diffReps还考虑了每个条件下的生物学变化,并提供了一些统计测试来选择用于评估差异。当生物复制可用时,我们建议使用NB测试,对一组中离散计数数据的过度分散(大变化)进行建模。在检测ChIP-seq数据的差异站点时,NB测试通常比检测更敏感。它也避免了对小计数数据造成误报的问题。每组中只有一个样本,可以选择Pearson卡方或G检验进行差异分析。以下示例显示了通过比较两个实验条件来检测差异位点的diffReps,其中每个条件包含三个BED文件。每个条件的输入样品也可以使用。我们使用200 bp的窗口大小获得染色质修饰景观和NB测试的高分辨率分析,以评估统计学意义。

      diffReps.pl --tr T1.bed T2.bed T3.bed - co C1.bed C2.bed C3。 - btr T_input.bed - bco C_input.bed - re diffsite.txt - gname mm9 - 我nb --wi 200 --st 20

      注意:diffReps需要输入文件为BED格式( https://genome.ucsc.edu/FAQ/FAQformat#format1 ),可以使用bedtools包(Quinlan和Hall,2010)从BAM格式转换,使用命令:

      bedtools bamtobed -i inputfile.bam > outputfile.bed


      图6.通过比较两个条件来检测来自ChIP-seq数据的差异位点的挑战。 A.两个峰均高于显着性截止值,但显示结合差异。 B.一个峰值高于截止值,另一个峰值低于截止值,但它们没有显着差异。 C.有一个峰值有显着差异的区域,但考虑整个峰值时可能无法检测到。

    2. 我们分析的最后一步是峰值列表或差异站点列表的功能注释。这传统上是通过首先将每个基因组区域注释到最接近的基因,然后测试该Goecks基因组中功能术语的富集。要根据它们所在的基因组特征的类型来注释每个被称为峰或差异位点,则使用region_analysis程序。 region_analysis程序能够注释任何文本文件,其中前三列分别是染色体,开始和结束位置。

      region_analysis.py -i input_file -g mm9 -d refseq -r

    3. 可以用各种开源或市售工具完成注释到峰或差异位点的基因列表的基因富集分析。一个非常受欢迎的开放源码工具是DAVID,而商业上可用的替代方案是Ingenuity Pathway Analysis(IPA)。另外,峰的功能解释中的另一个发展是与基因调控区重叠,所述基因调控区可以是TSS的近端区域或远至1Mb的远端区域。这种方法是有用的,因为随着ChIP-seq的出现,我们现在能够描述不限于启动子的基因组区域。可能含有调控信息的潜在基因组区域包括远程增强子以及基因体。此功能大大增强了我们揭示组蛋白标记或DNA结合蛋白的以前未知功能的能力。这个类别中的代表性工具被称为"基因组区域丰富的注释工具"(GREAT)。 GREAT只需要输入一个可以从峰值调用或差分输出输出轻松获得的BED文件。这些分析工具主要是基于网络的,感兴趣的读者可以参考他们的网站了解更多信息。
  9. 预期成果
    这里概述的ChIP-seq协议是一种非常强大和全面的技术,用于表征大脑中组蛋白标记的染色质修饰景观。 ChIP-seq数据不仅允许我们查看传统的监管区域 - 启动子,还可以调查远程监管站点,如增强子和外显子。随着这种增强的能力,我们现在进入了一个新的时代,以前所未有的细节研究与疾病相关的染色质修饰变化。

数据分析

对于使用上述分析程序对七个组蛋白修饰标记和RNA-seq的数据集研究可卡因诱导的表观基因组和小鼠脑转录组学变化的一些具体实例,读者参考我们以前的研究(Feng等人)。 ,2014)。

致谢

这项工作得到拨款支持:P50MH096890和P01DA008227。

参考文献

  1. Chen,Y.,Negre,N.,Li,Q.,Mieczkowska,JO,Slattery,M.,Liu,T.,Zhang,Y.,Kim,TK,He,HH,Zieba,J.,Ruan,Y 。,Bickel,PJ,Myers,RM,Wold,BJ,White,KP,Lieb,JD and Liu,XS(2012)。  系统评估影响ChIP-seq保真度的因素。 Nat方法 9(6):609-614。 br />
  2. Dennis,G.,Jr.,Sherman,BT,Hosack,DA,Yang,J.,Gao,W.,Lane,HC and Lempicki,RA(2003)。< a class ="ke-insertfile"href = "http://www.ncbi.nlm.nih.gov/pubmed/12734009"target ="_ blank"> DAVID:用于注释,可视化和集成发现的数据库。 Genome Biol 4(5):P3。
  3. Feng,J.,Wilkinson,M.,Liu,X.,Purushothaman,I.,Ferguson,D.,Vialou,V.,Maze,I.,Shao,N.,Kennedy,P.,Koo, Dias,C.,Laitman,B.,Stockman,V.,LaPlant,Q.,Cahill,ME,Nestler,EJ,and Shen,L。(2014)  小鼠伏隔核细胞中的慢性可卡因调控的表观基因组变化 Genome Biol 15( 4):R65。
  4. Goecks,J.,Nekrutenko,A.,Taylor,J.和Galaxy,T。(2010)。< a class ="ke-insertfile"href ="http://www.ncbi.nlm.nih.gov/pubmed/20738864"target ="_ blank"> Galaxy:一种全面的方法,用于支持生命科学中可访问,可重复和透明的计算研究。 Genome Biol 11(8):R86 。
  5. Kent,WJ,Sugnet,CW,Furey,TS,Roskin,KM,Pringle,TH,Zahler,AM和Haussler,D。(2002)。 UCSC的人类基因组浏览器 Genome Res 12(6):996-1006。 br />
  6. Kharchenko,PV,Tolstorukov,MY and Park,PJ(2008)。  DNA结合蛋白的ChIP-seq实验的设计和分析。生物技术26(12):1351-1359。
  7. Landt,SG,Marinov,GK,Kundaje,A.,Kheradpour,P.,Pauli,F.,Batzoglou,S.,Bernstein,BE,Bickel,P.,Brown,JB,Cayting,P.,Chen,Y. ,DeSalvo,G.,Epstein,C.,Fisher-Aylor,KI,Euskirchen,G.,Gerstein,M.,Gertz,J.,Hartemink,AJ,Hoffman,MM,Iyer,VR,Jung,YL,Karmakar, S.,Kellis,M.,Kharchenko,PV,Li,Q.,Liu,T.,Liu,XS,Ma,L.,Milosavljevic,A.,Myers,RM,Park,PJ,Pazin,MJ,Perry, MD,Raha,D.,Reddy,TE,Rozowsky,J.,Shoresh,N.,Sidow,A.,Slattery,M.,Stamatoyannopoulos,JA,Tolstorukov,MY,White,KP,Xi,S.,Farnham, PJ,Lieb,JD,Wold,BJ和Snyder,M。(2012)。< a class ="ke-insertfile"href ="http://www.ncbi.nlm.nih.gov/pubmed/22955991"目标="_ blank"> ENCODE和modENCODE联盟的ChIP-seq指南和做法。 Genome Res 22(9):1813-1831。
  8. Langmead,B。和Salzberg,SL(2012)。与Bowtie 2快速的间隙读取对齐。 Nat方法 9(4):357-359。
  9. Lee,TI,Jenner,RG,Boyer,LA,Guenther,MG,Levine,SS,Kumar,RM,Chevalier,B.,Johnstone,SE,Cole,MF,Isono,K.,Koseki,H.,Fuchikami,T ,Abe,K.,Murray,HL,Zucker,JP,Yuan,B.,Bell,GW,Herbolsheimer,E.,Hannett,NM,Sun,K.,Odom,DT,Otte,AP,Volkert, Bartel,DP,Melton,DA,Gifford,DK,Jaenisch,R.and Young,RA(2006)。< a class ="ke-insertfile"href ="http://www.ncbi.nlm.nih。 goop/pubmed/16630818"target ="_ blank"> Polycomb在人类胚胎干细胞中的发育调节因子的控制细胞 125(2):301-313。
  10. Liang,K. and Keles,S。(2012)。
  11. Li,H.,Handsaker,B.,Wysoker,A.,Fennell,T.,Ruan,J.,Homer,N.,Marth,G.,Abecasis,G.,Durbin,R.and 1000 Genome Project Data Processing子组。 (2009)。序列对齐/地图格式和SAMtools 。生物信息学 25(16):2078-2079。
  12. Loh,YH,Shao,NY and Shen,L.(2016)。  ChIPseqRUs:用于ChIP-seq预处理的流水线。 github repository 。
  13. Maze,I.,Feng,J.,Wilkinson,MB,Sun,H.,Shen,L.and Nestler,EJ(2011)。< a class ="ke-insertfile"href ="http://www .ncbi.nlm.nih.gov/pubmed/21300862"target ="_ blank">可卡因可动态调节异染色质和重复元件在核伏核中的无效。 Proc Natl Acad Sci USA 108(7 ):3035-3040。
  14. McLean,CY,Bristor,D.,Hiller,M.,Clarke,SL,Schaar,BT,Lowe,CB,Wenger,AM和Bejerano,G。(2010)。< a class ="ke-insertfile"href ="http://www.ncbi.nlm.nih.gov/pubmed/20436461"target ="_ blank"> GREAT可改善顺式调控区域的功能解释。 Nat Biotechnol 28 (5):495-501。
  15. Park,PJ(2009)。  ChIP-seq:advantage和成熟技术的挑战。 Nat Rev Genet 10(10):669-680。
  16. Pepke,S.,Wold,B.和Mortazavi,A.(2009)。  ChIP-seq和RNA-seq研究的计算。 Nat方法 6(11 Suppl):S22-32。
  17. Quinlan,AR and Hall,IM(2010)。  BEDTools :用于比较基因组特征的灵活的实用工具。生物信息学 26(6):841-842。
  18. Renthal,W.,Kumar,A.,Xiao,G.,Wilkinson,M.,Covington,HE,3rd,Maze,I.,Sikder,D.,Robison,AJ,LaPlant,Q.,Dietz,DM,Russo ,SJ,Vialou,V.,Chakravarty,S.,Kodadek,TJ,Stack,A.,Kabbaj,M.and Nestler,EJ(2009)。< a class ="ke-insertfile"href ="http: //www.ncbi.nlm.nih.gov/pubmed/19447090"target ="_ blank">可卡因染色质调控的基因组范围分析揭示了sirtuins的作用。神经元 62 (3):335-348。
  19. Robison,AJ和Nestler,EJ(2011)。转录和成瘾的表观遗传机制。 Nat Rev Neurosci 12(11):623-637。
  20. Robinson,JT,Thorvaldsdottir,H.,Winckler,W.,Guttman,M.,Lander,ES,Getz,G.and Mesirov,JP(2011)。< a class ="ke-insertfile"href ="http ://www.ncbi.nlm.nih.gov/pubmed/21221095"target ="_ blank">综合基因组学观察者 Nat Biotechnol 29(1):24-26。< br />
  21. Shao,NY,Loh,YH and Shen,L.(2016)。  区域分析:用于基因组间隔注释的python程序。 github仓库。 Zenodo。
  22. Shen,L.,Shao,NY,Liu,X.,Maze,I.,Feng,J.and Nestler,EJ(2013)。  DiffReps:通过生物重复检测来自ChIP-seq数据的差异染色质修饰位点。 PLoS One 8 6):e65598。
  23. Shen,L.,Shao,N.,Liu,X.和Nestler,E。(2014)。 ngs.plot:通过整合基因组数据库快速挖掘和观察下一代测序数据。 > BMC Genomics 15: >
  24. Szalkowski,A.M.and Schmid,C.D。(2011)。 ChIP-seq峰值呼叫算法的快速创新是超越基准测试。 简要生物信息 12(6):626-633。
  25. Wilbanks,EG和Facciotti,MT(2010)。  评估的ChIP-seq峰检测中的算法性能。 PLoS One 5(7):e11471。
  26. Zhang,Y.,Liu,T.,Meyer,CA,Eeckhoute,J.,Johnson,DS,Bernstein,BE,Nusbaum,C.,Myers,RM,Brown,M.,Li,W.and Liu,XS 2008)。基于模型的ChIP-Seq分析( MACS)。 Genome Biol 9(9):R137。
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2017 The Authors; exclusive licensee Bio-protocol LLC.
引用:Loh, Y. E., Feng, J., Nestler, E. and Shen, L. (2017). Bioinformatic Analysis for Profiling Drug-induced Chromatin Modification Landscapes in Mouse Brain Using ChlP-seq Data. Bio-protocol 7(3): e2123. DOI: 10.21769/BioProtoc.2123.
提问与回复

(提问前,请先登录)bio-protocol作为媒介平台,会将您的问题转发给作者,并将作者的回复发送至您的邮箱(在bio-protocol注册时所用的邮箱)。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。

当遇到任何问题时,强烈推荐您通过上传图片的形式提交相关数据。