The genome assembly processes used are shown in fig. S1. The initial olive baboon genome assembly, Pham_1.0, used only Sanger and Roche 454 data. This assembly is no longer available at NCBI but can still be accessed at the University of California, Santa Cruz (UCSC). To avoid confusion, we emphasize that although listed under Baboon (hamadryas) and named papHam1 on the USCS genome browser and at the Ensembl Pre! Site (, this first version of the assembly was derived not from a hamadryas baboon but from the female olive baboon identified above. The analyses reported here only used the later improved assemblies, Panu_2.0 and Panu_3.0.

Panu_2.0 (named Panu_2.0 in NCBI and papAnu2.0 in Ensembl and UCSC; GenBank accession GCA_000264685.1) was produced from the available Sanger, Roche 454, and Illumina reads, derived from the same female olive baboon used for Pham_1.0. Assembly analyses used the GAC (Genomic Analysis Cluster) compute facilities at the Baylor College of Medicine Human Genome Sequencing Center (HGSC). Sanger and Roche 454 reads were first assembled using CABOG version 6.1 with parameter settings of utgErrorRate = 0.02, ovlErrorRate = 0.07, cnsErrorRate = 0.07, cgwErrorRate = 0.12, and unitigger = bog. Two sets of 100–base pair (bp) Illumina read data, 2 billion reads from a 240-bp insert paired-end library, and 500 million reads from a 2.5-kb insert mate-pair library were mapped to the CABOG assembly using BWA with default parameters. The scaffolds of this initial CABOG-generated assembly were improved on the basis of the read mapping locations using Atlas-Link version 1.0 (, with the minimum required links (min_link) set at four for the 240-bp library and three for the 2.5-kb library. The Atlas-GapFill version 1.0 process ( was then performed to fill gaps between contigs within scaffolds by extracting local read pairs and aligning the local assemblies of these pairs to the gaps.

The assembled contigs and scaffolds of Panu_2.0 were placed on baboon chromosomes by mapping to the rhesus macaque (Macaca mulatta) genome assembly (GCF_000002255.3, mmul_051212, rhemac2) using Mummer3 (parameters = nucmer -l 12 -c 65 -g 1000 -b 1000; delta-filter -1 -l 500; show-coords -cl -L 500). It should be noted that chromosome organization is largely conserved between rhesus macaque and baboon (45). A baboon scaffold was split when it did not have continuous alignment on the macaque genome and if the potential breakpoint was validated by low clone coverage in the baboon data (low coverage defined as clone coverage from the 2.5-kb Illumina library of <5×). A set of 323 scaffolds (a total of 217 Mb) were identified this way and therefore split. The N50 of the contigs in the Panu_2.0 assembly is 40.3 kb, and the N50 of the scaffolds is 529 kb. The total length of the Panu_2.0 assembly is 2.95 Gb with 55.1 Mb of gaps. Because the scaffolds for Panu_2.0 (and Panu_3.0) have been mapped onto baboon chromosomes, this genome assembly is presented in public databases (NCBI, UCSC, and Ensembl genome browsers) as chromosome-associated sequences rather than as sets of independent scaffolds and superscaffolds.

Last, we improved the Panu_2.0 assembly through two additional methods. First, a small number of differences between the baboon and rhesus macaque genomes were identified using fluorescence in situ hybridization (FISH) mapping of probes containing human BAC sequences. These scaffolds were refined to be consistent with the FISH results from the baboon genome. Last, a total of 12× whole-genome coverage was produced on the PacBio RSII platform, with half of the reads >7 kb. These data were mapped to the Panu_2.0 assembly and two-thirds (67%) of the 118,928 gaps within scaffolds were closed using PBJelly software (46). The base quality of the assembly was polished using the Pilon program (47) and the available Illumina data.

This final assembly (Panu_3.0) has a contig N50 of 149.8 kb and, due to the mapping of these scaffolds to chromosomes, it has near whole chromosome length superscaffolds. The gap filling with PBJelly added only 10.98 Mb to the assembly (0.37% of the Panu_2.0 assembly length), but closed a large number of gaps, reducing the number of contigs from 198,931 to 118,251. The Panu_3.0 assembly was tested against available baboon EST (Expressed Sequence Tag) sequence datasets to quantify extent of coverage (i.e., completeness). Of the 144,708 Sanger EST sequences available at the time of testing, 99.98% were successfully mapped to the assembly. Among the total ESTs, 98.77% mapped with >90% of their length and 97.48% mapped at >95% of length. Seven finished BAC clones were mapped to the Panu_2.0 assembly. The genomic coverage in the BACs was high, with 98 to 100% of the BAC sequence in the assembly. The assembled contigs and scaffolds were aligned linearly to the finished BACs, suggesting that misassemblies are rare. Within Panu_3.0, only 3.2% of the sequence falls in unscaffolded contigs.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.