Advanced Search
Published: Sep 20, 2022 DOI: 10.21769/BioProtoc.4503 Views: 975
Edited by: Sanzhen Liu Reviewed by: Alba Blesa
Abstract
High-throughput chromosome conformation capture (Hi-C) technology has become an economical and robust tool for generating a chromosome-scale assembly. However, high-quality chromosome scaffoldings are limited by the number of short and chimeric contigs, making the assembly quality unsatisfactory in most cases. Here, we present a Hi-C scaffolding protocol based on ALLHiC, which integrates multiple functions to break chimeric contigs and generate chromosome-scale scaffolds. In addition, we describe a convenient way to curate the remaining misassemblies. This pipeline has been successfully applied to many genome projects, including our previously published banyan tree and oolong tea genomes.
Keywords: Genome assemblyBackground
Construction of a chromosome-scale assembly is a step-by-step process, including generating a contig-level assembly and linking contigs into scaffolds or chromosomes by long-range linking information, such as optical maps, 10× Genomics Linked-Reads, or Hi-C (Zhang et al., 2019; Zhang et al., 2020a). Hi-C (Lieberman-Aiden et al., 2009) is a technology derived from 3C (Chromosome Conformation Capture) technology integrated with next-generation sequencing, which serves as an economical and robust method widely used in many genome projects (Dudchenko et al., 2017; Zhang et al., 2020b). Hi-C can capture many chromatin proximities in parallel and span a long distance of genomic regions, even separated by >200 Mb. Hence, based on proximity information from Hi-C, contigs can be linked into chromosome-scale assemblies with great clarity.
In the past decade, several Hi-C scaffolding algorithms have been developed, including LACHESIS (Burton et al., 2013), SALSA (Ghurye et al., 2017), and 3D-DNA (Dudchenko et al., 2017). Our team also developed a Hi-C scaffolding algorithm, namely ALLHiC. Although initially designed for chromosome phasing in polyploid genomes (Zhang et al., 2019), ALLHiC also shows capability for chromosome-scale assembly in diploid genomes. Here, we describe a pipeline for chromosome scaffolding of diploid genomes by ALLHiC, from an initial contig-level assembly to a high-quality chromosomal-scale assembly (Figure 1). This pipeline uses the Hi-C paired-end reads to generate a mosaic genome assembly of diploids and provides a method to correct chimeric scaffolds. Moreover, this pipeline has been successfully applied in the diploid chromosome scaffolding in the banyan tree (Zhang et al., 2020b) and oolong tea genomes (Zhang et al., 2021).

Figure 1. Workflow of our pipeline of chromosome-scale scaffolding using ALLHIC.
Software and Data sets
Software
Linux OS
Bwa (version 0.7.17) (https://github.com/lh3/bwa)
samtools (version 1.9) (http://www.htslib.org/)
bedtools (https://github.com/arq5x/bedtools2)
asmkit (version 0.0.1) (https://github.com/wangyibin/asmkit)
ParaFly (version 2013-01-21) (http://parafly.sourceforge.net)
3D-DNA (https://github.com/aidenlab/3d-dna)
juicebox_scripts (https://github.com/phasegenomics/juicebox_scripts)
Perl (version 5) (https://perl.org)
Python (version 3) (https://python.org)
Matplotlib (version 3.3.4) (https://matplotlib.org)
Pandas (version 1.3.0) (https://pandas.pydata.org)
Pysam (version 0.18.0) (https://pysam.readthedocs.io/en/latest)
Numpy (version 1.16.5) (https://www.numpy.org)
Windows/MacOS
Juicebox (https://github.com/aidenlab/Juicebox)
Data sets
The input data of our pipeline are Hi-C data with fastq format and draft assembly with fasta format.
A small testing data set could be downloaded from google drive (https://drive.google.com/file/d/1oE6HpOTZ6rFSlVLOjO0EpIH_-cULCWec/view?usp=sharing).
Note: These small data sets were generated from Arabidopsis thaliana; the contig assembly was downloaded from 1001 genomes (https://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/Ler-1/Assemblies/Allpaths_LG/Ler-1.allpaths_lg.final.assembly.fasta), and Hi-C data downloaded from http://ibi.hzau.edu.cn/3dmodel/download/mp2014_raw_data.tar.gz.
Procedure
Category
Bioinformatics and Computational Biology
Plant Science > Plant molecular biology > DNA > DNA sequencing
Systems Biology > Genomics > Sequencing
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Share
Bluesky
X
Copy link