Data processing was performed using python packages RDKit36 (v2020.09.1.0) and chembl_structure_pipeline (v1.0.0) (https://github.com/chembl/ChEMBL_Structure_Pipeline). Generated SMILES strings were converted to canonical SMILES, InChI, and InChIKey molecular representations by sequential application of RDKit functions Chem.MolFromSmiles followed by Chem.MolToSmiles, Chem.inchi.MolToInchi or Chem.inchi.MolToInchiKey respectively. SMILES strings were considered syntactically invalid if no valid molecular representation was returned from either Chem.MolFromSmiles, Chem.MolToSmiles, Chem.inchi.MolToInchi or the Chem.inchi.MolToInchiKey operation. Unique molecular representations, whether canonical SMILES, InChI or InChIKey, were identified by creating a dictionary from the respective molecular representations using the dict.fromkeys(molecular representation) command. Unique generated molecules were then converted to molblock with RDKit function Chem.MolToMolblock before being passed through the ChEMBL structure pipeline to sequentially (1) check for structure quality using checker.check_molblock, (2) standardize structures with chembl_structure_pipeline.standardize_molblock and finally, (3) get parent structures by removing isotopes, salts and solvents with standardizer.get_parent_molblock. Structures returning checker penalty scores of more than 5 were removed. The maximum error score (Max_Error_Score) and the error types (Error_Type) for each remaining entry were recorded. 27 RDkit molecular descriptors (BalabanJ, BertzCT, NumAromaticRings, HallKierAlpha, Kappa1, Chi0, Chi0n, Chi0v, MolLogP, MolMR, MolWt, ExactMolWt, HeavyAtomCount, HeavyAtomMolWt, NHOHCount, NOCount, NumHAcceptors, NumHDonors, NumHeteroatoms, RingCount, FractionCSP3, TPSA, LabuteASA, NumRotatableBonds, NumValenceElectrons, NumSaturatedRings, NumAliphaticRings) from the were calculated and appended for each remaining entry.
Natural product-likeness scores (NP_score)42 for each generated molecule were calculated using npscorer (https://github.com/rdkit/rdkit/tree/master/Contrib/NP_Score). Natural product pathway (pathway), superclass (superclass), and class (class_type) classifications were assigned using NPClassifier API (https://npclassifier.ucsd.edu/)43. Queries without outputs from NPClassifier were assigned the value “none”. Percentage population of generated database receiving value “none” – pathway (11.6%), superclass (40.0%), class (51.1%).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.