Advanced Search
Published: Mar 30, 2024 Views: 610
VCF files for each autosome were downloaded from th one thousand genomes project. They are now available at e.g., https://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz for chromosome 1.
These were then split by population using the population IDs in https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel to produce population-specific VCFs, which we named as “${pop}_chr_${chrom}_hg19.vcf”. We used the following python script to split the original vcfs:
##### start python script
import glob, pandas, numpy, multiprocessing,gzip
csv = pandas.read_csv(
"integrated_call_samples_v3.20130502.ALL.panel",
header=0,
names=['Sample', 'Population', 'super_pop', ‘gender’]
)
pop_labels = {}
pops = []
for row in csv.iterrows():
pop_labels[row[1]["Sample"]] = row[1]["Population"]
pops.append(row[1]["Population"])
pops = set(pops)
def split(fname):
print(fname)
fh = gzip.open(fname, 'r')
chrom_idx = int(fname.split(".")[1].split("_")[0][3:]) - 1
inds = None
ofhs = {}
for p in pops:
ofhs[p] = open(p + "_chr_" + str(chrom_idx+1) + "_hg19.vcf",'w')
for line in fh:
if line[0] == "#":
if line[0:6] == "#CHROM":
la = line.split()
fields = la[0:9]
inds = numpy.array(la[9:])
pops_ordered = numpy.array([pop_labels[ind] for ind in inds])
for p in ofhs:
newline = "\t".join(fields + inds[pops_ordered == p].tolist()) + "\n"
ofhs[p].write(newline)
else:
for p in ofhs:
ofhs[p].write(line)
else:
la = line.split()
fields = la[0:9]
if not(len(la[3]) == 1 and len(la[4]) == 1):
continue
for p in ofhs:
gts = numpy.array(la[9:])[pops_ordered == p]
keep = False
for g in gts:
if g[0] == gts[0][0] and g[2] == gts[0][0]:
continue
else:
keep = True
break
if keep:
newline = "\t".join(fields + gts.tolist()) + "\n"
ofhs[p].write(newline)
fh.close()
for p in ofhs:
ofhs[p].close()
p = multiprocessing.Pool(22)
p.map(split, glob.glob("ALL*.vcf.gz"))
##### end python script
We then used the following python script which calls a bash script (called run_smcpp.bash with source code copied below) which calls smc++. smc++ has changed somewhat in the 5+ years between when these scripts were run and when this protocol was written, so it is important to use smc++ v1.11.1dev (should be this commit). These scripts require bgzip and tabix. They also require the hs37d5 mappability masks to be in the same directory. Those can be downloaded at https://share.eva.mpg.de/index.php/s/ygfMbzwxneoTPZj
The following python script should be called using one representative VCF per population, e.g. using bash expansion, this can be called as
python smcpp_controller.py {ACB,ASW,BEB,CDX,CEU,CHB,CHS,CLM,ESN,FIN,GBR,GIH,GWD,IBS,ITU,JPT,KHV,LWK,MSL,MXL,PEL,PJL,PUR,STU,TSI,YRI}_chr_22_hg19.vcf
##### start smcpp_controller.py script
import sys, subprocess
for fname in sys.argv[1:]:
pop = fname.split("_")[0]
print(pop)
fh = open(fname)
for line in fh:
if "#CHROM" in line:
inds = line.split()[9:]
break
args = [pop] + inds[0:5] + [",".join(inds)]
subprocess.call(["bash", "run_smcpp.bash"]+args)
##### end smcpp_controller.py script
##### start run_smcpp.bash script
pop=$1
dist1=$2
dist2=$3
dist3=$4
dist4=$5
dist5=$6
allInds=$7
#make smcpp files
for chrom in {1..22}
do
echo "running" $chrom
bgzip ${pop}_chr_${chrom}_hg19.vcf
tabix ${pop}_chr_${chrom}_hg19.vcf.gz
for distinguished in $dist1 $dist2 $dist3 $dist4 $dist5
do
smc++ vcf2smc --mask hs37d5_chr${chrom}.mask.bed.gz -d ${distinguished} ${distinguished} ${pop}_chr_${chrom}_hg19.vcf.gz ${pop}_${chrom}_${distinguished}.smcpp ${chrom} ${pop}:${allInds} &
done
wait
done
mkdir $pop
#run smcpp
echo running smcpp
smc++ estimate -o ${pop}/ 1.25e-8 ${pop}*.smcpp
#clean up
echo cleaning up
rm ./${pop}*.smcpp
##### end run_smcpp.bash script
The results are output in the smc++ format (as .json files). Plotting was done using “smc++ plot”. See https://github.com/popgenmethods/smcpp .
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Share
Bluesky
X
Copy link