The gene_list_generation function assembles a list of genes based on the patient’s phenotype and overlaps it with gene names that span SVs. User-provided, phenotype-based keywords are used to generate a gene list from the following databases: ClinVar [35], OMIM (https://omim.org/), GTR [36], and the NCBI’s Gene database (www.ncbi.nlm.nih.gov/gene). The input to the function is a term, which can be provided as a single term input (method = “Single”), a vector of terms (method = “Multiple”), or a text file (method = “Text”). The output can be a dataframe or text. The rentrez [37] and VarfromPDB [38] R-language packages are used to extract data related to each of the user-provided phenotypic terms, from the individual databases. For the Gene database rentrez provides the entrez IDs associated with each gene, which are converted in nanotatoR to gene symbols using org.Hs.eg.db [37], a Bioconductor package. For OMIM, rentrez provides the OMIM record IDs, which are used to extract the corresponding disease-associated genes from the OMIM ID-to-gene ID conversion dataset (mim2gene.txt). For GTR, rentrez extracts the GTR record IDs, which are then used to extract corresponding gene symbols from the downloaded GTR database. For ClinVar, VarfromPDB is used to extract genes corresponding to the input term. All genes to which the query keyword is attached, irrespective of their clinical significance, are extracted; genes of clinical significance (i.e. those for which Pathogenic/Likely Pathogenic variants are reported) are further reported in a separate column. The user also has the option to download the ClinVar and GTR databases by choosing downloadClinvar = TRUE and downloadGTR = TRUE, which may improve run times. The user has an option to save the datasets (removeClinvar = FALSE and removeGTR = FALSE) or delete the database after the analysis is completed (removeClinvar = TRUE and removeGTR = TRUE).
The output is provided in CSV format with 3 columns: “Genes”, “Terms”, and “ClinicalSignificance”. The “Terms” column contains the list of terms associated to each gene and corresponding database from where the association was derived. The “ClinicalSignificance” column contains genes that have clinical significance (Pathogenic/Likely Pathogenic variants) for the associated term, derived from the ClinVar database. The output of entrez extract serves as input for the subsequent variant filtration step.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.