Alternative polyadenylation (APA) analysis: Python and R scripts used to perform the alternative polyadenylation analysis are available in “Supplementary file 2: Alternative polyadenylation analysis code,” found in the “Additional files” section of the manuscript. This document serves as the help file to execute the codes.
Dependency packages:
Python packages: pybedtools, pandas, concurrent.futures R Packages: DRIMSeq # BiocManager::install("DRIMSeq") fastp: https://github.com/OpenGene/fastp.git
pigz: https://github.com/madler/pigz.git
bowtie2: https://github.com/BenLangmead/bowtie2.git samtools: https://github.com/samtools/samtools.git
Usage: PACSeqAnalysisPolyA_DB.py <config file>
Configuration file: User can specify raw data directory, reference genome, reference polyA annotations, key analysis parameters and other output parameters in the configuration file, saved as a “yaml” file.
# Header
APSeq:
# base directory of raw data (fastq data in zipped gz format) #
baseDir: RawData
# output directory – all the results will be saved to this directory #
outDir: TestAPAOutput
# maximum CPU cores to use #
mc: "10"
# Path to reference bowtie2 index #
ref_genome: /Index/Human_hg38_UCSC/hg38
# Path to reference genome fasta sequence #
ref_fasta: / Index/Human_hg38_UCSC/hg38.fasta
# Path to reference gene bed file #
ref_bed: /Index/Human_hg38_UCSC/hg38.ucsc.refseq.bed
# Path to reference PolyA data (PolyA_DB) bed file – from PolyADB#
ref_polyA: /mnt/data/Index/PolyA_DB3/hg38.PASsignal.bed
# AP block – reads within this distance of an annotated polyA site are considered. Typically, a read’s length #
APAblock: 69
# Merge AP blocks distance –Cluster annotated polyA sites within this distance #
mddb: 0
# Merge features distance – reads within this distance are pooled as one polyA site #
md: 15
# AP site anchor/feature overlap - reads that overlap by at least an anchor distance are assigned to the same polyA peak #
anchor: 1
# control samples: Should be same as zipped (gz) fastq files, excluding file extensions - if raw files are WT1.fastq.gz, WT2.fastq.gz, or WT3.fastq.gz, replace the file extensions #
ct: WT1,WT2,WT3
# treatment samples: Should be same as zipped (gz) fastq files excluding file extensions #
tr: MT1,MT2,MT3
# poverA – fraction of samples required to have a minimum read count rc (option bellow) #
pA: 0.60
# Read cutoff – at least the pA fraction of samples in either of the conditions should have a minimum of rc reads #
rc: 5
# Minimum Reads per polyA site #
M: 1
# file key – output file name head#
fkey: AltPA_Results
Test environment: Ubuntu 16.04.6, Python 3.6.8, bowtie2 v2.2.6, samtools v1.4.1, bedtools v2.25.0, fastp v0.19.4. Default parameters should work for most of the datasets. However, we recommend adjusting the pA, rc, and M parameters to account for data variability and sequencing depth.