To map reads to repetitive elements, we created a pseudo-genome that only contains the repeat sequences. The used in house scripts are available here. In summary the pseudo-genome can be created as follows (in a Linux operating system)
Clone our code repository
git clone https://github.com/sirusb/2CLike_analysis.git
Go to the Pseudogenome folder
cd Pseudogenome
Download the mm9 repeats annotation from RepEnrich google-dive here and put in the Pseudogenome folder.
Decompress the downloaded 'mm9_repeatmasker_clean.txt.gz` file as follows:
gunzip mm9_repeatmasker_clean.txt.gz
Create an mm9 fasta file that contains all the chromosomes present in the 'mm9_repeatmasker_clean.txt.gz` using the following bash script:
chroms=`cat mm9_repeatmasker_clean.txt | awk '{print $5}' | uniq | grep chr`
genome_version='mm9'
## Download the different chromosome .fa files
for f in $chroms
do
echo "Downloading chr${f}.fa.gz"
wget http://hgdownload.cse.ucsc.edu/goldenPath/${genome_version}/chromosomes/${f}.fa.gz -O ${f}.fa.gz zcat ${f}.fa >> ${genome_version}.fa
done
## Remove intermediate files
echo "removing intermediate files"
rm chr*.fa.gz
Open the file run_buildPseudogenome.sh and edit the path to Picard tools.
Run the run_buildPseudogenome.sh script
sh run_buildPseudogenome.sh
Once the script finish running it will create the rms_Pseudo_out folder that contains the newly created genome and its STAR index.