Inference of population size histories

Jeffrey P. Spence; Yun S. Song

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Preprint

Inference of population size histories

JS Jeffrey P. Spence

YS Yun S. Song

Last updated date: Mar 30, 2024 Views: 610 Forks: 0

An abbreviated version of this protocol was published in Science Advances in Oct, 2019

Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations

Download PDF

Ask a question

How to cite

Favorite

VCF files for each autosome were downloaded from th one thousand genomes project. They are now available at e.g., https://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz for chromosome 1.

These were then split by population using the population IDs in https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel to produce population-specific VCFs, which we named as “${pop}_chr_${chrom}_hg19.vcf”. We used the following python script to split the original vcfs:

##### start python script

import glob, pandas, numpy, multiprocessing,gzip

csv = pandas.read_csv(

"integrated_call_samples_v3.20130502.ALL.panel",

header=0,

names=['Sample', 'Population', 'super_pop', ‘gender’]

)

pop_labels = {}

pops = []

for row in csv.iterrows():

pop_labels[row[1]["Sample"]] = row[1]["Population"]

pops.append(row[1]["Population"])

pops = set(pops)

def split(fname):

print(fname)

fh = gzip.open(fname, 'r')

chrom_idx = int(fname.split(".")[1].split("_")[0][3:]) - 1

inds = None

ofhs = {}

for p in pops:

ofhs[p] = open(p + "_chr_" + str(chrom_idx+1) + "_hg19.vcf",'w')

for line in fh:

if line[0] == "#":

if line[0:6] == "#CHROM":

la = line.split()

fields = la[0:9]

inds = numpy.array(la[9:])

pops_ordered = numpy.array([pop_labels[ind] for ind in inds])

for p in ofhs:

newline = "\t".join(fields + inds[pops_ordered == p].tolist()) + "\n"

ofhs[p].write(newline)

else:

for p in ofhs:

ofhs[p].write(line)

else:

la = line.split()

fields = la[0:9]

if not(len(la[3]) == 1 and len(la[4]) == 1):

continue

for p in ofhs:

gts = numpy.array(la[9:])[pops_ordered == p]

keep = False

for g in gts:

if g[0] == gts[0][0] and g[2] == gts[0][0]:

continue

else:

keep = True

break

if keep:

newline = "\t".join(fields + gts.tolist()) + "\n"

ofhs[p].write(newline)

fh.close()

for p in ofhs:

ofhs[p].close()

p = multiprocessing.Pool(22)

p.map(split, glob.glob("ALL*.vcf.gz"))

##### end python script

We then used the following python script which calls a bash script (called run_smcpp.bash with source code copied below) which calls smc++. smc++ has changed somewhat in the 5+ years between when these scripts were run and when this protocol was written, so it is important to use smc++ v1.11.1dev (should be this commit). These scripts require bgzip and tabix. They also require the hs37d5 mappability masks to be in the same directory. Those can be downloaded at https://share.eva.mpg.de/index.php/s/ygfMbzwxneoTPZj

The following python script should be called using one representative VCF per population, e.g. using bash expansion, this can be called as

python smcpp_controller.py {ACB,ASW,BEB,CDX,CEU,CHB,CHS,CLM,ESN,FIN,GBR,GIH,GWD,IBS,ITU,JPT,KHV,LWK,MSL,MXL,PEL,PJL,PUR,STU,TSI,YRI}_chr_22_hg19.vcf

##### start smcpp_controller.py script

import sys, subprocess

for fname in sys.argv[1:]:

pop = fname.split("_")[0]

print(pop)

fh = open(fname)

for line in fh:

if "#CHROM" in line:

inds = line.split()[9:]

break

args = [pop] + inds[0:5] + [",".join(inds)]

subprocess.call(["bash", "run_smcpp.bash"]+args)

##### end smcpp_controller.py script

##### start run_smcpp.bash script

pop=$1

dist1=$2

dist2=$3

dist3=$4

dist4=$5

dist5=$6

allInds=$7

#make smcpp files

for chrom in {1..22}

echo "running" $chrom

bgzip ${pop}_chr_${chrom}_hg19.vcf

tabix ${pop}_chr_${chrom}_hg19.vcf.gz

for distinguished in $dist1 $dist2 $dist3 $dist4 $dist5

smc++ vcf2smc --mask hs37d5_chr${chrom}.mask.bed.gz -d ${distinguished} ${distinguished} ${pop}_chr_${chrom}_hg19.vcf.gz ${pop}_${chrom}_${distinguished}.smcpp ${chrom} ${pop}:${allInds} &

done

wait

done

mkdir $pop

#run smcpp

echo running smcpp

smc++ estimate -o ${pop}/ 1.25e-8 ${pop}*.smcpp

#clean up

echo cleaning up

rm ./${pop}*.smcpp

##### end run_smcpp.bash script

The results are output in the smc++ format (as .json files). Plotting was done using “smc++ plot”. See https://github.com/popgenmethods/smcpp .

How to cite：

Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:

Spence, J and Song, Y(2024). Inference of population size histories. Bio-protocol Preprint. bio-protocol.org/prep2625.
Spence, J. P. and Song, Y. S.(2019). Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations. Science Advances 5(10). DOI: 10.1126/sciadv.aaw9206

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

1 Q&A

Questions about the mapabilty bed files you have used.

0 Answer 8 Views Nov 27, 2024

This protocol preprint was submitted via the "Request a Protocol" track.

Share your protocol with your peers.

Submit a Preprint Protocol