Advanced Search
Published: Aug 5, 2022 DOI: 10.21769/BioProtoc.4475 Views: 974
Edited by: Sanzhen Liu Reviewed by: Liangliang Gao
Abstract
Hi-C is a chromosome conformation capture method originally developed to detect genome-wide chromatin interactions. Nowadays, it is widely applied in scaffolding de novo assembled contigs into chromosome-scale genome sequences. Multiple open-source software has been developed to perform genome scaffolding with Hi-C data. The input data is de novo assembled contigs using long-read or short-read sequencing. Then, Hi-C data is mapped to these contigs, and the interact matrix is computed by software to scaffold contigs into chromosome-scale sequences. Different tools have specific algorithms to calculate the interact matrix and correct misassemblies and misjoins and may require different dependent packages or running environments. Here, we describe a step-by-step protocol for genome scaffolding using Hi-C data with a comprehensive pipeline: compute interact matrix with Juicer, scaffold contigs with 3D-DNA pipeline, and then visualize and modify scaffolding with Juicebox. This is the first detailed protocol showing how to do Hi-C scaffolding using this pipeline in plants. Compared to many other pipelines, this protocol only requires primarily assembled contigs and raw Hi-C data as inputs. Moreover, it is also compatible with multiple enzymes, and provides visualization and the possibility for manual correction. Currently, more and more genomes are sequenced combining Hi-C; this step-by-step protocol may be applied widely in mass large eukaryotic genome scaffolding.
Keywords: Hi-CBackground
A plant genome provides valuable information to researchers for all kinds of molecular biological studies. In recent years, the development of sequencing technology has allowed faster and more affordable genome sequencing. Nevertheless, chromosome-scale genome sequences are still hard to obtain with only next-generation sequencing (NGS) or long-read sequencing due to some complicated genomic structures, like long interspersed repeats or highly homologous genome blocks. To conquer this, a genetic linkage map or optical map has been applied, to order and orient contigs into chromosome-scale sequences (Yamaguchi et al., 2021). However, the genetic linkage map is labour- and time-consuming in the Plantae kingdom. Meanwhile, the optical map requires a large quantity and high quality of high molecular weight DNA, which makes its production relatively difficult. In contrast, the quickly developed Hi-C scaffolding only requires 100 mg of plant tissue, and short-read sequencing on the NGS platform. This makes the Hi-C scaffolding both tissue- and cost-affordable. However, we need to be aware of the Hi-C library preparation, which will determine the success of the genome scaffolding. Young leaf tissue is commonly used in Hi-C library preparation for plants. Multi-round quality control is recommended during library preparation (Kadota et al., 2020). In particular, small-scale sequencing is highly recommended to evaluate the quality of the library, including the proportion of valid interaction reads, and estimation of the proper read pairs for further deep sequencing. Hi-C scaffolding has become one of the main solutions to obtain chromosome-scale scaffolds, having been widely utilized in recent plant genome sequencing projects. Meanwhile, multiple open-source software have been developed to compute the interact matrix, and order and orient assembled contigs into scaffolds (Table 1). Among these tools, the 3D-DNA pipeline is a widely used software that supports interactively visualizing and manually modifying the scaffolds.
Table 1. Overview of the major Hi-C scaffolding software
Program | Input format | Other information | Literature |
---|---|---|---|
3D-DNA | Juicer mapper format | Compatible with multiple enzymes; results can be visualized and modified by Juicebox | (Dudchenko et al., 2017) |
LACHESIS | Generic bam format | No function to correct misjoins; developer’s support discontinued | (Burton et al., 2013) |
HiRise | Generic bam format | Used in Dovetail Chicago/Hi-C service; no open-source update available since 2015 | (Putnam et al., 2016) |
SALSA2 | Generic bam (bed) file, assembly graph, unitig, 10× link files | Compatible with multiple enzymes; results can be visualized by Juicebox | (Ghurye et al., 2019) |
ALLHiC | Hi-C reads; gene annotation or closely related chromosome-scale reference genome | Designed for scaffolding plant polyploid genome | (Zhang et al., 2019) |
HiCAssembler | Hi-C matrix in h5 format created by HiCExplorer | Assembly errors can be manually corrected by specifying the position in the software | (Renschler et al., 2019) |
instaGRAAL | Hi-C matrix created by hicstuff or HiC-Box | Requires NVIDIA CUDA and GPU environment | (Baudry et al., 2020) |
Software
Trimmomatic (Bolger et al., 2014) (http://www.usadellab.org/cms/?page=trimmomatic)
Juicer (Durand et al. 2016) (https://github.com/aidenlab/juicer/)
3D-DNA pipeline (Dudchenko et al. 2017) (https://github.com/aidenlab/3d-dna)
Juicebox (version 1.11.08) (https://github.com/aidenlab/Juicebox)
BWA (Li and Durbin, 2009) (http://bio-bwa.sourceforge.net/)
Samtools (Li et al., 2009) (http://www.htslib.org/)
Miniconda (https://docs.conda.io/en/latest/miniconda.html)
BUSCO (Seppey et al., 2019) (https://gitlab.com/ezlab/busco)
Java 1.8 JDK (https://www.oracle.com/java/technologies/downloads/#java8)
Note: We recommend users to use the latest version of each software listed above, except for Juicebox (v. 1.11.08).
Equipment
Linux server or cluster
PC or Mac with at least 16GB RAM for handling big genomes (>1GB)
Input data
De novo assembly contigs file in FASTA format
Raw Hi-C sequencing data in FASTQ format
Procedure
Category
Bioinformatics and Computational Biology
Plant Science > Plant molecular biology > Genetic analysis
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Share
Bluesky
X
Copy link