For the haplotype-based analyses we worked on a dataset where the Greenlandic data and the European reference samples were merged. We kept all sites present in both datasets and excluded 52 sites with more than 2% missing data. The resulting merged dataset had 135,702 loci and 12,247 individuals with a total genotyping rate of 0.9995 and all loci with a minor allele count of at least 5.
The merged Greenlandic-European dataset was split by chromosome and phased without a reference panel using SHAPEIT34 (v2.r904) with default settings, using the HapMap phase II recombination map for hg19.
After merging and phasing, we removed close relatives among all Greenlandic individuals by retaining at most one individual from each pair of individuals with a coefficient of relatedness > 0.2. Then we split the remaining Greenlanders into two sets based the results of a K = 2 ADMIXTURE: 1) the un-admixed Greenlanders with >99% inferred Inuit ancestry, and 2) the admixed Greenlanders with > % inferred European ancestry, for additional details see Data S1. From the second set, we removed seventeen Greenlandic individuals estimated to have >5% African or >7% Asian ancestry in a K = 4 ADMIXTURE analyses including 1000 genomes samples from China (CHB), Nigeria (YRI), the US (CEU). These thresholds were selected to exclude individuals that differed markedly from the majority of other Greenlandic individuals (data not shown) and to be able to avoid having to include any Asian and African reference samples in our fine-scale analyses. We also excluded admixed Greenlandic individuals living in Denmark as these individuals may be more likely to have Danish ancestry than other European ancestries. This left us with a dataset consisting of 1582 not closely related Greenlanders with European admixture (admixed samples), 181 not closely related unadmixed Greenlanders (Inuit reference samples), and 8303 European reference samples.
Based on the results of a pilot ChromoPainter analysis, we subsequently excluded 28 of the European reference samples because they were significant outliers (z-score > 5), based on comparing their total chunk counts to the rest of the individuals from their country (not shown). An atypically high number of chunks can be indicative of low data quality. This resulted in a final set of 8275 European reference samples (Figure S1) and thus 8275+181 = 8456 reference samples in total and 1582 not closely related Greenlanders with European admixture. These data were used to infer ancestry contributions, for details of this analysis see the Quantification and statistical analysis section and Data S2.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.