The performance of PubChemLite was assessed using various datasets that were already used to evaluate MetFrag performance; CASMI 2016 [16] and MetFrag Relaunched [26] (hereafter MetFragRL). The CASMI2016 dataset consisted of 208 compound-MS/MS spectra pairs. The MetFragRL evaluation sets consisted of four groups of spectra measured under different conditions (datasets EA, EQEx, EQExPlus and UF, with n = 473, 289, 310 and 226, where n refers to the number of compound-MS/MS spectrum pairs). The calculations performed on the individual datasets are presented in Additional file 3: Table S1 and Figure S1, alongside the previously published results. Since some compounds had mass spectra available in both modes, and there was some overlap between the different datasets, this corresponded to a total of 1298 (MetFragRL) and 1506 (MetFragRL + CASMI) compound-MS/MS pairs overall. Calculations performed on this set (comparing PubChemLite tiers and CompTox) are presented in Additional file 3: Table S2 and Figure S2. For the purpose of clarity in the main manuscript, this set of 1506 was de-duplicated down to a set of 977 unique compounds by InChIKey First Block after accounting for multiple tautomeric forms, to eliminate any confusion due to the presence of duplicate spectra/modes. The MS/MS spectrum record number (the first-matching entry in the case of multiple spectra) was used to automatically extract and save the corresponding MS/MS peaks into the file using an R script, using the MS/MS spectra provided as SI for the respective studies, downloaded from the journal pages [16, 26]. As all compounds were present in PubChem, additional compound information was filled in using PubChem web services via R functions. The final benchmarking file (hereafter “PCLite Benchmark” set) is available as Additional file 2 and on the ECI GitLab pages, along with all associated code [62].
The PCLite Benchmark set was used to evaluate various versions of PubChemLite (dates: 18/11/2019 [35], 14/01/2020 [36], 22/05/2020 [49], 12/06/2020 [49] and 31/10/2020 [40]) as well as the CompTox Chemicals Dashboard version from 7/03/2019 archived as MetFrag Local CSV (database) files [39, 65]. Files are not yet available from the most recent CompTox release (but have been requested). The “Select Metadata” version of CompTox was used, which contained 857,615 entries, corresponding to 773,561 DTXCID InChIKeys and 773,232 InChIKey First Blocks associated with DTXCIDs (the CompTox “MS-ready” form [66] of information used in MetFrag). All CompTox files from the given release contain the same number of entries, just with varying metadata content. All queries were run with exact mass plus 5 ppm error, additional scoring terms and other parameters as detailed in Additional file 3: Table S4 and in the supporter scripts available on the ECI GitLab pages [67].
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.