# Also in the Article

Gene catalog construction and gene annotation
This protocol is extracted from research article:
Characterization of the human skin resistome and identification of two microbiota cutotypes
Microbiome, Feb 17, 2021;

Procedure

To construct the skin microbiome gene catalog, sequencing reads from this study as well as from HMP were processed (quality control, removal of human sequences, assembling, gene prediction) using the pipeline shown in Supplementary Fig. 1. SOAPnuke [62] was used for quality control. SOAPaligner2 [63] was for identifying and removing human sequences if they shared > 95% similarity with the human genome reference sequence (hg19) [11]. Consistent with previous findings, on average 80% reads were from human origin instead of microorganisms (Supplementary Fig. 2b). High-quality reads were used for de novo assembly via SPAdes (version 3.13.0) [64], which generated the initial assembly results based on different k-mer sizes (k = 21, 33, 55, 77,99). Ab initio gene identification was performed for all assembled scaffolds by MetaGeneMark (version 3.26) [65]. These predicted genes were then clustered at the nucleotide level by CD-HIT (version 4.5.4), CD-HIT parameters are as follows: - G 0 - M 90000 - R 0 - t 0 - C 0.95 - as 0.90 [66], genes sharing greater than 90% overlap and greater than 95% identity were treated as redundancies. Thus, we obtained a two cohorts non-redundant gene catalog (2CGC) including 13,324,649 genes. To further ensure the integrity of the gene catalog, we did the following: first, sequence alignment was carried out between 2CGC and National Center for Biotechnology Information non-redundant nucleotide (NCBI-NT, downloaded at Aug. 2018): 931 genera genomes (including 2,761 prokaryotes, 112 fungi, 479 viruses)—were identified to be existing in 2CGC (Table S2); we then downloaded the genomes or draft genomes of these microbes and used MetaGeneMark to predict the coding regions; these predicted genes were later pooled, and the software CD-HIT was used to remove the redundant genes. Thus, we got 7,496,818 non-redundant genes, which we refer to as the sequenced gene catalog (SGC). Finally, the gene catalogs based on 2CGC and SGC were combined using CD-HIT. Genes existing in at least ten samples were selected to form the final iHSMGC, which comprised 10,930,638 genes.

To evaluate the genome integrity of a single microbe in iHSMGC, we constructed draft microbial reference genomes of 5409 bacteria, 2023 viruses, and 158 fungi (https://ftp.ncbi.nlm.nih.gov/genomes/) and sequenced alignment iHSMGC with the database. The definite means were as follows: (1) predicting the coding sequence (CDS) of genomes and (2) map iHSMGC with genome CDS using the BWA MEM method (default parameter). The coverage of each genomic CDS region was obtained.

Taxonomic classification of genes was performed based on the National Center for Biotechnology Information non-redundant nucleotide (NCBI-NT, downloaded at Aug. 2018) database. We aligned about 11 million genes of iHSMGC onto the NCBI-NT using BLASTN (v2.7.1, default parameters except that -evalue 1e-10 outfmt 6 -word_size 16). At least 70% alignment coverage of each gene was required. For multiple best-hits (from NCBI-NT database) mapping for the same gene with the same %identity, e value and bit score, we have used the following strategy:

We performed statistics on multiple best-hits (from NCBI-NT database) mapping for the same gene, including the number of annotated species present, the number of occurrences of each annotated species, and the average similarity of the same species. After completion of the statistics, the species annotation with the highest frequency and the highest average similarity was defined as the annotation of the gene. In case that different species for a single gene ranked the same in the statistics, we have chosen the species annotation that ordered first (i.e., the order of blast hits and e value). Accordingly, 95% identity was used as the critical value for species assignment, 85% identity was used as the critical value for genus assignment, and 65% for phylum assignment [6]. The 3.97 million genes of the gene catalog were annotated taxonomically.

We aligned putative amino acid sequences, which translated from the iHSMGC, against the proteins or domains in KEGG databases (release 84.0, genes from animals or plants were excluded) using BLASTP (v2.7.1, default parameters except that -outfmt 6 -evalue 1e-6). At least 30% alignment coverage of each gene was required. Each protein was assigned to a KEGG orthologue (KO) based on the best-hit gene in the database. Using this approach, 6.42 million of the genes in the combined gene catalog could be assigned a KO.

The high-quality reads from each sample were aligned against the gene catalog by SOAP2.21 with the criterion of identity > 90% [63]. In our sequence-based profiling analysis, the alignments that met one of the following criteria as previously described could be accepted [67]: (i) an entire of a paired-end read can be mapped onto a gene with the correct insert-size and (ii) only when the one end of paired-read was mapped outside the genic region; the other end of reads can be mapped onto the end of a gene. In both cases, the mapped read was counted as one copy. The formula used in this study for calculating gene relative abundance is similar to RPKM/FPKM (reads per kilobase of exon model per million mapped reads/fragments per kilobase of exon model per million mapped fragments) value. Accordingly, for any sample 푆, we calculated the abundance as follows:

Step 1: Calculation of the copy number of each gene:

Step 2: Calculation of the relative abundance of gene i:

ai: The relative abundance of gene i in sample S

Li: The length of gene i

xi: The times which gene i can be detected in sample S (the number of mapped reads)

bi: The copy number of gene i in the sequenced data from S.

j: The iHSMGC gene number.

The value of bi standardizes the effect of gene length in Step 1. The value of $bi∑jbi$ standardizes the effect of sequencing depth in Step 2.

The relative abundances of phyla, genera, species, and KOs were calculated from the relative abundance of their respective genes using previously published methods [68]. For the species profile, we used the phylogenetic assignment of each gene from the original gene catalog and summed the relative abundance of genes from the same species to generate the abundance of that species. The phyla, genera, and KO profile were constructed using the same methods.

We used a rarefaction curve to assess the gene richness in our cohorts. For each given number of samples, we performed random sampling 100 times in the cohort with replacement. Moreover, we estimated the total number of genes that could be identified from these samples with the Chao2 index [69].

Antibiotic resistance genes (ARGs) were identified using the Resistance Gene Identifier (RGI, v4.2.2) with default parameters and the CARD database (The Comprehensive Antibiotic Resistance Database, v3.0.7) [70]. DIAMOND was utilized for alignment [71]. In order to identify the species origins of drug resistance genes, the similarity of the predicted ARG segments to known species was estimated by aligning the predicted ARGs to the NCBI-NT using BLASTN (v2.7.1, default parameters except that -evalue 1e-10 outfmt 6 -word_size 16), and identified genes had an alignment coverage greater than 70%.

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A