Principal component analysis, k-means, and ADMIXTURE (dx.doi.org/10.17504/protocols.io.bkwbkxan in protocols.io)

IA Israel Aguilar-Ordoñez
FP Fernando Pérez-Villatoro
HG Humberto García-Ortiz
FB Francisco Barajas-Olmos
JB Judith Ballesteros-Villascán
RG Ram González-Buenfil
CF Cristobal Fresno
AG Alejandro Garcíarrubio
JF Juan Carlos Fernández-López
HT Hugo Tovar
EH Enrique Hernández-Lemus
LO Lorena Orozco
XS Xavier Soberón
EM Enrique Morett
request Request a Protocol
ask Ask a question
Favorite

The pipeline for running PCA, k-means, and ADMIXTURE from a single dataset can be downloaded: https://github.com/jbv2/VCF2PCP. In brief, from the IPVS, we kept only our NM samples and the 4 NP individuals from the 1000 genomes project (samples ids: HG01926, HG01938, HG01961, HG02272). We kept biallelic SNVs with a MAF > 0.05 with bcftools v1.9-220-gc65ba41, and removed variants in linkage disequilibrium (r2 > 0.85) with bcftools +prune plugin using parameters—window 2000bp—nsites-per-win 1. We transformed VCF files into Eigenstrat format. PCA was performed using Smartpca from Eigensoft v6.1.4 [72] requesting numoutevec: 20. We kept eigenvectors with P-value < 0.01, then recalculated the percentage of variability per eigenvector, being 100% the sum of the selected eigenvalues. In k-means analysis we calculated the Average Silhouette method to define optimal clustering. For ADMIXTURE v1.3 [20] analysis we used the—seed 43 parameter.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A