Paired-end reads are mapped using bowtie2 with the following parameters:
--no-discordant
--no-mixed
--very-sensitive
--no-unal
--omit-sec-seq
--xeq
--reorder
Reads are filtered using samtools view and the parameters "-b -F 1804 -f 2 -q 2". This removes reads with a quality score of 1 (multi-mapping, as marked by bowtie2), reads that do not have the "properly aligned" flag (bit 2), and reads that have any of the following flags: unmapped segment (bit 4), unmapped next segment (bit 8), secondary alignment (bit 256), failed filter (bit 512), and duplicate (bit 1024).
Duplicate reads are removed using picard markduplicates and the parameters "VALIDATION_STRINGENCY=LENIENT ASSUME_SORT_ORDER=queryname REMOVE_DUPLICATES=true". This removes reads originating from the same molecule as determined by comparison of their 5' positions and sequences. The read with the highest base quality sum is retained from the pool of duplicate reads.
Reads from replicate samples are pooled using samtools merge.
Pooled reads are sorted by name. This places read ends next to each other in the order.
K-mer databases are created for the genome using KMC2 and parameters "-k<KMERSIZE> -fm -ci2" with KMER_SIZE taking the value 50, 75, 80, 85, 90, 95, and 100 for a total of 7 k-mer databases, The remaining parameters specify that there are multiple sequences in the genome fasta file and to exclude k-mers that only appear once (unique) in the genome.
For each k-mer database, a count map wiggle file is produced using kmc_genome_counts and the genome fasta file. This wiggle file has a value for each base position specifying the number of times the corresponding k-mer from the genomic sequence appears anywhere in the genome. The k-mer sequence starts at the base position and extend downstream a number of bases equal to the k-mer size.
The k-mer count map wiggle files are converted to bigwig files using the UCSC genome browser utility wigToBigWig.
Pooled and sorted reads are filtered by unique k-mers using the custom script filter_by_unique_kmers.py found at https://github.com/msauria/T2T_Encode_Analysis/blob/main/bin/filter_by_unique_kmers.py and the k-mer bigwig files. This script loads in each of the k-mer count map bigwigs (specified by a filename template and comma-separated list of k-mer sizes). For each read, the reference stop position minus the reference start position (the reference span) is used to determine the k-mer length to use, selecting the largest k-mer size that is equal to or less than the reference span. Once the k-mer size has been determined, the score for each k-mer of that size occurring in the reference sequence covered by the mapped read is checked in the corresponding k-mer count map (e.g. a read mapping to 99 bases of reference sequence would use the k-mer size 95, and would contain 5 separate 95-mers that would be looked up in the count map). If any of the values in the count map corresponding to the contained k-mers is equal to one (signifying a unique k-mer sequence), the read and its mate are retained. Otherwise the read and its mate are discarded.
The k-mer-filtered reads are sorted by position using samtools sort.
Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:
Gershman, A, Sauria, M, Miga, K and Timp, W(2022). ENCODE dynamic k-mer–assisted mapping. Bio-protocol Preprint. bio-protocol.org/prep2033.
Gershman, A., Sauria, M. E. G., Guitart, X., Vollger, M. R., Hook, P. W., Hoyt, S. J., Jain, M., Shumate, A., Razaghi, R., Koren, S., Altemose, N., Caldas, G. V., Logsdon, G. A., Rhie, A., Eichler, E. E., Schatz, M. C., O’Neill, R. J., Phillippy, A. M., Miga, K. H. and Timp, W.(2022). Epigenetic patterns in a complete human genome. Science 376(6588). DOI: 10.1126/science.abj5089
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this
article to respond.
0/150
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Spinning
Post a Question
0 Q&A
Spinning
This protocol preprint was submitted via the "Request
a Protocol" track.