Systems metabolomics using interpretable learning and evolution (SMILE)

CS Chengyuan Sha
MC Miroslava Cuperlovic-Culf
TH Ting Hu
request Request a Protocol
ask Ask a question
Favorite

We propose a computational framework for metabolomics data analysis, Systems Metabolomics using Interpretable Learning and Evolution (SMILE). SMILE uses an evolutionary algorithm for learning interpretable predictive models, provides explanations of its decision-making, and identifies key metabolites and their interactions associated with a complex trait. In order to benefit a wider research community, we also developed a web application for SMILE with a graphical user interface, where researchers can perform interpretable machine learning analysis and visualize the results of their own metabolomic data. The source code of SMILE and the metabolomic data used in this study are publicly available at https://github.com/MIB-Lab/SMILE, the detailed documentation of function usage in SMILE is avaliable at https://smile-mib.rtfd.io, and its web application is published at https://smile-mib.cs.queensu.ca.

In the following subsections, we describe the core learning algorithm in SMILE, discuss the metabolite importance and interaction assessment approach, and show the utilization of SMILE web application.

Evolutionary algorithms define a collection of meta-heuristic optimization and modelling algorithms inspired by natural evolution [28]. An evolutionary algorithm maintains a population of diverse candidate solutions to a problem. An initial set of candidate solutions are often generated randomly. Each new generation is produced by probabilistically selecting better solutions for reproduction, and introducing small stochastic changes using biologically inspired operators such as mutation and crossover. Evolutionary computing has been successfully applied to machine learning problems, where it can automatically derive a symbolic predictive model. Such a variant of evolutionary algorithms was proposed as genetic programming [29], and has been used to solve classification and regression problems.

The evolutionary algorithm we used in this research is linear genetic programming (LGP) [30, 31]. LGP represents candidate predictive models in an evolutionary population using an imperative program. The fitness of a predictive model is defined as its classification accuracy. A population of diverse candidate models are initialized randomly and will improve fitness gradually through a large number of generations. After evolution, we obtain a best evolved model with the highest fitness score.

Similar to an imperative program, an LGP model consists of several instructions. Each instruction is either an assignment or a conditional statement. An assignment statement has three registers, i.e., one return register and two operand registers. For instance, in the LGP program shown in Fig. 1, instruction 1 assigns the value of r[8] minus a constant 4 to r[1]. The set of instructions are executed sequentially. The conditional if statement controls the program flow. If the condition is true, the subsequent instruction is executed, otherwise the subsequent instruction is skipped. In case of nested if statements, all conditions need to be true for the subsequent instruction to be executed. For example, in Fig. 1, line 5 is executed only if the conditions in line 3 and line 4 are both true. Register r[0] is the designated output register, and its final value after execution will be projected using a Sigmoid function to classify a sample either as diseased or healthy.

Representation of an LGP program. This example program has seven instructions, which will be executed in a sequential order. An instruction can be an assignment statement or an if statement. Registers are used to store input variables and to perform computation. r[1] to r[5] are calculation registers and r[6] to r[10] are input registers. Register r[0] is the designated output register and its final value after the execution of all instructions will be the output of this program

Note that not all instructions modify the final value stored in r[0]. We define an effective instruction as one that contributes to the final output, and a non-effective instruction otherwise, e.g., line 2 and line 6 in Fig. 1.

A register stores the value of a variable. There are two types of variables in our LGP programs, the input variables and the calculation variables. Input variables are predictive features, i.e., metabolite concentrations, in this work. A calculation variable is used as a buffer to enhance computation capacity. The designated output register r[0] is a special calculation register. Constants are chosen from a user-defined interval. Furthermore, a return register, i.e., the one on the left side of an assignment, can only be a calculation register. In this way, our method inherently prevents overriding of the input feature values.

In each generation, parent models are chosen using a tournament selection, i.e., a randomly chosen set of models compete and the fittest two will be picked to reproduce. To these selected parents are then applied genetic operations including crossover, macro-mutation, and micro-mutation with a certain probability. Crossover combines the genetic information of two parents to generate two new offspring. Two crossover points are picked randomly in each parent model, the instructions defined by the two points are swapped between two parent models. Macro-mutation insert or delete an instruction in a model. Micro-mutation randomly picks an instruction in a model and changes either a register or the operation in that instruction.

Then, the two new offspring replace the worst two models in the tournament, and their fitness values are computed. In each run of this evolutionary algorithm, this process is repeated until the limit of the number of generations is reached. The model with the highest fitness score will be saved as a result of evolution. A flowchart of our LGP algorithm is shown in Fig. 2.

LGP algorithm flowchart. After initialization, this evolutionary algorithm repeats the processes of parent selection, mutation, crossover, and replacement, until the generation limit has been reached

Due to the stochastic nature of evolutionary algorithms, each run may yield a different resulting best model. We collected 1000 independent runs of this LGP algorithm. The main parameters used are shown in Table. 1. We randomly partitioned the data into a training set (80%) and a testing set (20%) and used a different random seed for each independent run of the algorithm. The fitness value of a model is computed as the training classification accuracy. In order to prevent overfitting and to reduce the computational overhead of fitness calculation, we used bootstrapping and sampled 50–100% of the training set, without replacement, each time when computing the fitness of an individual model. The final best evolved predictive model of each run is then evaluated using the testing set. The testing accuracy and other prediction performance metrics are thus computed using unseen testing samples unique to each evolved best model.

Parameter configuration in the LGP algorithm

Note that input features, i.e., metabolite concentrations, are stored in registers in an LGP program. In the initialization of the program population, some input registers may be chosen in an instruction of a program, and some may not. In addition, an input register may be mutated or lost as a result of the mutation and crossover operations. An input register may also become ineffective when its value does not contribute to the calculation of the final value of r[0]. For instance, in Fig. 1, removing input register r[7] does not alter the final value of r[0], thus it is considered as an ineffective register, and its represented feature is considered as an ineffective feature. Therefore, the selection of effective metabolite features is embedded in the LGP algorithm and co-evolved with predictive models.

We can rank individual features based on their occurrence frequencies in the collected 1000 best evolved models. This ranking provides a means to assess feature importance, i.e., if a metabolite feature most frequently appears in the best evolved models it may have a strong influence on explaining the prediction of the disease status.

In addition, if two features tend to co-occur frequently in a same best evolved model, they may have a strong synergistic interaction effect associated with the disease. We calculated this co-occurrence frequency for all pairs of features. Subsequently, we can construct a metabolite synergy network by including the top metabolite pairs that show the strongest synergistic interactions. These most frequently co-occurring metabolite pairs are represented as edges and their two end points. Such a network can help us visualize a large collection of pairwise feature interactions, and identify important metabolites that interact with many others.

To facilitate a wider adoption of our proposed approach, we published all the source code of implementing our algorithm. For a robust prediction result and a comprehensive feature analysis, we recommend to collect a large number, e.g., 1000, of independent runs of the LGP algorithm. This can in turn require high computational power.

For the implementation and analysis included in this study, we used a large-scale high-performance computer cluster, Graham, from Compute Canada. We ran an array of 1000 jobs in parallel. Each job (an independent run of SMILE) took 8–10 hours and up to 500 MB memory running on one CPU core (Intel Xeon CPU E5-2667 v4 3.2 GHz).

After an individual job is completed, the result can be saved via calling the save_model() method. This method uses Python pickle module to implement object serialization. This will generate a .pkl result file that can be uploaded to the web interface later for interpretation and visualization. The web application also requires an original dataset .csv file. Users need to format the .csv file where rows are samples and columns are features (metabolite concentrations). The file header is the metabolite names and the first row is the class label (named “category”). Users can check formatting errors using an automated python file on SMILE’s Github page. Finally, users can upload the .pkl and .csv files to our web application.

We developed a web interface, https://smile-mib.cs.queensu.ca, for interpreting and visualizing the analysis results. First, a testing accuracy filter is provided for the user in order to include only the best-performing evolved models among all collected final evolved models by running the algorithm independently for 1000 times.

There are three modules for the result interpretation and visualization. The first module is Feature Importance Analysis. Users can decide to investigate LGP models with a specified number of effective features. Then, features are ranked based on their individual occurrence frequencies and showed in the “Feature Occurrence” graph. Clicking a feature of interest in this graph will show all LGP models containing that feature in the “Model Accuracy” graph. Further selection of a point in this graph will show its represented model in “Detailed Model Info” panel. This allows users to investigate and interpret a selected predictive model based on its testing accuracy and metabolite features involved.

Upon selecting the “Pairwise Co-occurrence Analysis” panel, users can see a heat map of “Feature Pairwise Co-occurrence”, which shows all the pairwise co-occurrence frequencies in the selected LGP models. Moreover, users can manually choose a pair of features to see their distributions in diseased cases and healthy controls in “Two-Feature Scatter Plot”.

The second module is Co-occurrence Network Analysis. Users can visualize a network of the top most common metabolite pairs. In this graph visualization, a node is a metabolite and an edge links two metabolites if their co-occurrence frequency is above the top threshold. The node size is proportional to individual feature’s occurrence frequency. The edge width is proportional to pairwise co-occurrence frequency, which is also labeled on each edge.

Users can also investigate a metabolite/feature of particular interest. The third module of SMILE is Search a Feature. This module allows users to enter the name of a specific feature, and will show this feature’s individual occurrence frequency and its interacting features, ranked by their co-occurrence frequencies. In addition, SMILE provides a visualization of the synergy sub-network of this feature that includes all its directly interacting neighbours.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A