Advanced Search
Last updated date: Mar 30, 2024 Views: 610 Forks: 0
VCF files for each autosome were downloaded from th one thousand genomes project. They are now available at e.g., https://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz for chromosome 1.
These were then split by population using the population IDs in https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel to produce population-specific VCFs, which we named as “${pop}_chr_${chrom}_hg19.vcf”. We used the following python script to split the original vcfs:
##### start python script
import glob, pandas, numpy, multiprocessing,gzip
csv = pandas.read_csv(
"integrated_call_samples_v3.20130502.ALL.panel",
header=0,
names=['Sample', 'Population', 'super_pop', ‘gender’]
)
pop_labels = {}
pops = []
for row in csv.iterrows():
pop_labels[row[1]["Sample"]] = row[1]["Population"]
pops.append(row[1]["Population"])
pops = set(pops)
def split(fname):
print(fname)
fh = gzip.open(fname, 'r')
chrom_idx = int(fname.split(".")[1].split("_")[0][3:]) - 1
inds = None
ofhs = {}
for p in pops:
ofhs[p] = open(p + "_chr_" + str(chrom_idx+1) + "_hg19.vcf",'w')
for line in fh:
if line[0] == "#":
if line[0:6] == "#CHROM":
la = line.split()
fields = la[0:9]
inds = numpy.array(la[9:])
pops_ordered = numpy.array([pop_labels[ind] for ind in inds])
for p in ofhs:
newline = "\t".join(fields + inds[pops_ordered == p].tolist()) + "\n"
ofhs[p].write(newline)
else:
for p in ofhs:
ofhs[p].write(line)
else:
la = line.split()
fields = la[0:9]
if not(len(la[3]) == 1 and len(la[4]) == 1):
continue
for p in ofhs:
gts = numpy.array(la[9:])[pops_ordered == p]
keep = False
for g in gts:
if g[0] == gts[0][0] and g[2] == gts[0][0]:
continue
else:
keep = True
break
if keep:
newline = "\t".join(fields + gts.tolist()) + "\n"
ofhs[p].write(newline)
fh.close()
for p in ofhs:
ofhs[p].close()
p = multiprocessing.Pool(22)
p.map(split, glob.glob("ALL*.vcf.gz"))
##### end python script
We then used the following python script which calls a bash script (called run_smcpp.bash with source code copied below) which calls smc++. smc++ has changed somewhat in the 5+ years between when these scripts were run and when this protocol was written, so it is important to use smc++ v1.11.1dev (should be this commit). These scripts require bgzip and tabix. They also require the hs37d5 mappability masks to be in the same directory. Those can be downloaded at https://share.eva.mpg.de/index.php/s/ygfMbzwxneoTPZj
The following python script should be called using one representative VCF per population, e.g. using bash expansion, this can be called as
python smcpp_controller.py {ACB,ASW,BEB,CDX,CEU,CHB,CHS,CLM,ESN,FIN,GBR,GIH,GWD,IBS,ITU,JPT,KHV,LWK,MSL,MXL,PEL,PJL,PUR,STU,TSI,YRI}_chr_22_hg19.vcf
##### start smcpp_controller.py script
import sys, subprocess
for fname in sys.argv[1:]:
pop = fname.split("_")[0]
print(pop)
fh = open(fname)
for line in fh:
if "#CHROM" in line:
inds = line.split()[9:]
break
args = [pop] + inds[0:5] + [",".join(inds)]
subprocess.call(["bash", "run_smcpp.bash"]+args)
##### end smcpp_controller.py script
##### start run_smcpp.bash script
pop=$1
dist1=$2
dist2=$3
dist3=$4
dist4=$5
dist5=$6
allInds=$7
#make smcpp files
for chrom in {1..22}
do
echo "running" $chrom
bgzip ${pop}_chr_${chrom}_hg19.vcf
tabix ${pop}_chr_${chrom}_hg19.vcf.gz
for distinguished in $dist1 $dist2 $dist3 $dist4 $dist5
do
smc++ vcf2smc --mask hs37d5_chr${chrom}.mask.bed.gz -d ${distinguished} ${distinguished} ${pop}_chr_${chrom}_hg19.vcf.gz ${pop}_${chrom}_${distinguished}.smcpp ${chrom} ${pop}:${allInds} &
done
wait
done
mkdir $pop
#run smcpp
echo running smcpp
smc++ estimate -o ${pop}/ 1.25e-8 ${pop}*.smcpp
#clean up
echo cleaning up
rm ./${pop}*.smcpp
##### end run_smcpp.bash script
The results are output in the smc++ format (as .json files). Plotting was done using “smc++ plot”. See https://github.com/popgenmethods/smcpp .
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Share
Bluesky
X
Copy link