Genotyping was performed using the Infinium™ H3Africa Consortium Array containing ~ 2.3 million markers. One thousand one hundred and fifty-eight samples were genotyped and processed in Illumina’s Genome Studio software (version 2.05) for variant calling following the COPILOT raw Illumina genotyping quality control (QC) protocols detailed in [38]. Seventy-seven samples with a genotyping call rate of less than 90% were excluded during the Illumina Genome Studio QC, and no sample was excluded further due to sample quality, as the genotyping call rate was 99.99%. Individual-level QC was carried out to exclude samples with sex discrepancies compared with X-chromosome-derived sex, heterozygosity outliers (heterozygosity ±3 SD from the mean), and genetically identical individuals (identity by descent, pi-hat ~ 1.0) (Supplementary Table 2), retaining 1006 individuals for imputation and downstream analysis. Per-marker QC excluded SNPs with call rate less than 97%, minor allele frequency < 1%, and SNPs that deviated from Hardy–Weinberg equilibrium (P < 10−8), leaving 1 925 391 autosomal SNPs and X-chromosomes. The overall genotyping rate was 99.99%. Quality control was carried out using PLINK v1.90 (www.cog-genomics.org/plink/1.9/).
To construct the Principal Component Analysis (PCA) of genotypes, we integrated our quality-controlled study dataset with the 1000 Genome reference Phase 3 version 5 [39] after extracting the overlapping markers and excluding the multi-allelic SNPs. The combined data was further filtered (genotype frequency less than 99% and minor allele frequency less than 5%) and pruned (—indep-pairwise 1500 150 0.2) while excluding regions of high linkage disequilibrium before generating the principal components in Plink 2.0 [40]. PCA, inclusive of our dataset, was performed twice: (i) with global populations (CEU: Utah residents with Northern and Western European ancestry for European ancestry, CHB: Han Chinese in Beijing, China, and JPT: Japanese in Tokyo, Japan representing East Asian ancestry, and YRI: Yoruba in Ibadan for African ancestry) and (ii) with a focus on the African continental populations consisting of ESN: Esan in Nigeria; GWD: Gambian in Western Division; LWK: Luhya in Webuye, Kenya; MSL: Mende in Sierra Leone; YRI: Yoruba in Ibadan, Nigeria. Within the continental African PCA plot, our study samples were classified into NG-S (participants enrolled from the South-west Nigeria recruitment site: Lagos) and NG-N (participants enrolled from the North-central (Abuja) and North-west (Zaria) recruitment sites in Nigeria). PCA plots were created in R v.4.2.2.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.