General protocol for the PMGL pipeline

KP Kim Palacios-Flores
JG Jair García-Sotelo
AC Alejandra Castillo
CU Carina Uribe
LA Luis Aguilar
LM Lucía Morales
LG Laura Gómez-Romero
JR José Reyes
AG Alejandro Garciarubio
MB Margareta Boege
GD Guillermo Dávila
request Request a Protocol
ask Ask a question
Favorite

The reference genome sequence in fasta format is used to generate a binary database of the reference genome using Bowtie (Langmead et al. 2009), and to generate the ordered set of reference strings (25 mers in this study) that constitute the entire reference genome. The number of exact occurrences of each reference string’s sequence in the reference genome database is computed. A Reference Genome Self Landscape (RGSL) is generated by reporting each reference string’s unique identifier, number of exact occurrences in the reference genome, sequence, and the unique identifiers of all reference strings sharing the same sequence. The raw query genome sequence reads in fastq format are used to generate a binary database of read string counts (25 mers in this study) computed by Jellyfish (Marçais and Kingsford 2011). The use of quality-trimmed sequence reads is not necessary (see Supplemental Material, File S1, Figure S1). A PMGL is generated by reporting the perfect match coverage between the reference genome and the query genome at each reference string along the RGSL. The perfect match coverage is then normalized by the level of repetitiveness of each reference string in the reference genome. Finally, the normalized perfect match coverage at reference string n is divided by the normalized perfect match coverage at reference string n−1. The latter corresponds to each reference string’s signature value. The PMGL is scanned to localize signatures of variation. A signature of variation is defined as a decrease in the normalized perfect match coverage that generates a trail of 0 or near-zero values terminating at position n−1, followed by its immediate recovery at position n. The PMGL scan parameters and their relation to sequencing coverage has been experimentally addressed (Figure S1). Zero-trail signatures of variation are associated with a high signature value at position n. For single nucleotide variants, microindels, and indels, the reference string at position n, or downstream recovery string, corresponds to a perfect match zone that is immediately adjacent to the variation. Its sequence is used to identify the subset of query genome sequence reads that perfectly contain it. The query genome sequence(s) defined by such sequence reads is aligned with the corresponding region of the reference genome using the MUSCLE multiple sequence alignment tool (Edgar 2004). The nature of the variant(s) is revealed by an iterative process of alignment interpretation and extension resulting in a single final alignment. Finally, discovered variants are introgressed into the original reference genome sequence to generate a customized reference genome. The disappearance of signatures of variation using the customized reference genome and the original query genome sequence reads as input for the PMGL pipeline validates the precise location and nature of the discovered variants.

The PMGL pipeline has been fully automated and comprises six computational modules: (1) generation of the RGSL, (2) generation of the PMGL, (3) scanning of the PMGL, (4) generation of the first alignment at each signature of variation, (5) interpretation and extension of alignments, and (6) generation of a customized reference genome. All modules are described in detail in File S1.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A