Dataset. Two hundred twenty-two samples are presented here. Of these, 167 Italians and 6 Albanians were specifically selected and sequenced for this project with two versions (1.2 and 1.3) of the Infinium Omni2.5-8 Illumina beadchip, while 49 Italians and Europeans were genotyped with the Human660W-Quad BeadChip in the frame of another research (data file S1) (35). Two separate worldwide datasets were prepared. The FMD included 4852 samples (2, 12, 22, 3651) (1589 Italians) and 218,725 SNPs genotyped with Illumina arrays; the HDD contained 1651 samples (12, 36, 38, 40, 44, 47, 50, 51) (524 Italians) and 591,217 SNPs genotyped with the Illumina Omni array (Supplementary Materials).

The merging, the removal of ambiguous C/G and A/T and triallelic markers, the exclusion of related individuals, and the discarding of SNPs in LD were performed using PLINK 1.9 (52). Only autosomal markers were considered.

Haplotype analysis (CP and fS). We investigated patterns of genetic differentiation in Italy by exploring the information provided by SNP-based haplotypes. Phased haplotypes were generated using SHAPEIT 2 (53) and by applying the HapMap b37 genetic map.

CP was used to generate a matrix of recipient individuals “painted” as a combination of donor samples (copying vector). Three runs of CP were done for each dataset generating three different outputs: (i) a matrix of all the individuals painted as a combination of all the individuals, for cluster identification and GT analysis; (ii) a matrix of all Italians as a combination of all Italians, for FST analysis; and (iii) a matrix of all the samples as a combination of all the other samples but excluding Italians, for noItaly GT analysis. The median numbers of SNPs per painted fragment were 13.7 and 31.6 for the FMD and HDD, respectively.

Clusters were inferred using fS. After an initial search based on the “greedy” mode, the dendrogram was processed by visual inspection (2, 20) according to the geographical origin of the samples. The robustness of the cluster was obtained by processing the Markov Chain Monte Carlo (MCMC) pairwise coincidence matrix (Supplementary Materials).

Cluster self-copy analysis. Recently admixed individuals were identified as those copying from members of the cluster to which they belong less than the amount of cluster self-copying for samples with all four grandparents from the same geographic region (Supplementary Materials).

FST and TVD within and between Italian clusters. To have a comprehensive overview of the genetic diversity in Italy, we estimated the pairwise FST within Italian clusters using smartpca implemented in EIGENSOFT (54). TVD estimates were obtained using the TVD function (11) on the CP chunklength matrix.

Pairwise genetic diversity among clusters was computed estimating pairwise FST and TVD metrics on Italian clusters belonging to the same or to different macroareas. In detail, the NItaly macroarea comprised clusters named as NItaly and NCItaly; the SItaly macroarea included SItaly, SCItaly, and Sicily named clusters; while the Sardinia macroarea included only the Sardinia clusters.

FST estimates among clusters. Pairwise FST estimates among newly generated Italian clusters and originally generated European clusters (Supplementary Materials) were inferred using the smartpca software implemented in the EIGENSOFT package (54). Comparisons between the FST distributions were performed using a Wilcoxon rank sum test in the R programming language environment.

Principal components analysis. PCA was performed on the CP chunkcount matrix (Supplementary Materials) and was generated using the prcomp() function on R software. Allele frequency PCA was performed using smartpca implemented in EIGENSOFT (54) after pruning the datasets for LD.

Procrustes analysis. To validate the correlation observed between the haplotype-based PCA (Fig. 1) and the cluster distribution (Fig. 1C), we performed a symmetric Procrustes analysis with 100,000 permutations. In detail, we used the values of the first two PCs of the PCA estimated on the CP chunkcount matrix generated using only Italian individuals for which the place of origin (administrative region) was available along with the geographic coordinates of the administrative center (“capoluogo di regione”) of the region to which they were assigned to on the basis of available information (data file S1).

Characterization of the migration landscape (EEMS analysis). We highlighted the spatial patterns of genetic differentiation by EEMS analysis (19). This was performed estimating the average pairwise distances between populations using the bed2diffs tool, and the resulting output was visualized using the Reems package (19).

ADMIXTURE analysis. ADMIXTURE 1.3.0 software (55) was used, performing 10 different runs using a random seed. The results were combined with CLUMPAK (56) using the largeKGreedy algorithm and random input orders with 10,000 repeats. Distruct for many K’s implemented in CLUMPAK (56) was then used to identify the best alignment of CLUMPP results. Results were processed using the R statistical software.

The time and the sources of admixture events (GT and MALDER analyses). Times of admixture events were investigated using the GLOBETROTTERv2 software. GT was utilized using two approaches: complete and nonlocal (referred to as noItaly; Supplementary Materials) in default modality (2, 11). The difference between the two approaches was the inclusion or the exclusion respectively of all the Italian clusters as donors in the CP matrix used as the input file. To improve the precision of the admixture signals, the “null.ind: 1” parameter was set (2). Unclear signals were corrected using the default parameters, and a total of 100 bootstraps were performed. MALDER (57) uses allele frequencies to dissect the time of admixture signals. The best amplitude was identified and used to calculate a Z score (Supplementary Materials). A Z score equal to or lower than 2 identifies not significantly different amplitude curves (Supplementary Materials) (58).

Sources for both GT and MALDER were grouped in different ancestries as indicated in the legend of Fig. 3 and fig. S8. The expression [1950 − (g + 1)*29], where g is the number of generation, was used to convert the GT and MALDER results into years.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.