Model Development

SS Sangwoo Seo
YK Youngmin Kim
HH Hyo-Jeong Han
WS Woo Chan Son
ZH Zhen-Yu Hong
IS Insuk Sohn
JS Jooyong Shim
CH Changha Hwang
ask Ask a question
Favorite

The problem of predicting clinical successes and failures of clinical trials is modeled as a binary classification task. For a given drug i, the target label is a binary variable yi, where yi=1 indicates that the drug is passed and yi=0 indicates otherwise. Our dataset contains n=828 drugs, where each is represented by a pair of feature vector xi and a corresponding clinical outcome yi: (xi,  yi)i=1n, where xi=(xi(1),xi(2)) and xi(1) and xi(2) represent the chemical feature vector and target-based feature vector, respectively. The data associated with this task are bimodal and highly imbalanced. Both modalities are associated with chemical properties of the drugs and target-based properties, respectively. Thus, we need to join effectively two different modalities. In addition, we also need to consider the model that deals with class-imbalance problem.

Figure 1 explains the entire workflow of the proposed OPCNN classifier for the prediction of successes and failures in clinical trials. Our OPCNN consists of three residual blocks and five fully connected (FC) layers. Each residual block has three convolution layers, each of which employs 32 kernels with kernel size 3 and stride size 1, and the rectified linear unit (ReLU) activation function. The numbers in parentheses of FC(1), FC(50), and FC(100) indicate the number of nodes. FC(1) layer employs the sigmoid activation function. Both FC(50) and FC(100) layers employ the rectified linear unit (ReLU) activation function. Our method consists of two stages. First, the representative feature vectors of chemical feature vector and target-based feature vector are calculated and then the outer products between these two representative feature vectors are calculated. Second, a 2D CNN model is adopted to extract deep features from the outer products and to predict successes and failures of clinical trials.

A workflow of the proposed OPCNN classifier for predicting successes and failures of clinical trials. Given an outer product of two representative feature vectors as an input, 2D CNN is used to learn features. The architecture of OPCNN consists of three residual blocks and five fully connected (FC) layers. Each residual block has three convolution layers. (A) OPCNN classifier (B) Residual block.

The process of calculating the outer product is as follows. The chemical feature vector x(1)R13 and the target-based feature vector x(2)R34 in different modalities are first fed into the FC(50) layer to get representative feature vectors f(1)R50and f(2)R50 and improve their performance. Given f(1)R50 and f(2)R50, the outer product on the augmented unimodal is calculated as follows:

Here, indicates the outer product between vectors. Thus, this outer product produces two sets of information: the bimodal interactions in the form of two-dimensional tensor and the raw unimodal representations of the modalities. The tensor calculated by such outer product is directly fed into the first residual block. The final representation is used for the classification task.

Classification with multimodal data often occurs in many machine learning applications (Baltrušaitis et al., 2019; Gao et al., 2020). Multimodal learning is an effective approach to combine information from multiple modalities to perform a prediction task. The modalities may be independent or correlated. Fusing multiple modalities is a key issue in any multimodal task. In general, the fusion of multiple modalities can be achieved at three levels: at the level of features or at a lower layer, at the intermediate level, and at the level of decisions. Fusion at the feature level or at a lower layer is called early fusion. On the other hand, fusion at the intermediate layer is called intermediate fusion, whereas fusion at the level of decisions is called late fusion. Because early and late fusions can generally suppress either intra-modality or inter-modality interactions, recent studies have focused on intermediate methods that allow fusion to occur on multiple layers of a deep model.

Figure 2 illustrates a graphical representation for deep multimodal neural network (DMNN) models associated with the early, intermediate, and late fusions used in the study. As seen from Figure 2, each DMNN model consists of several FC layers. The number in parentheses indicates the number of nodes. As in Figure 1, the FC(1) layer employs the sigmoid activation function. Both FC(50) and FC(100) layers employ the ReLU activation function. In the case of early fusion, each modality is first fed into an FC(50) layer before fusion in order to improve performance and to apply several fusion techniques. However, the standard early fusion allows multiple modalities to be directly concatenated to produce a single multimodal vector. In the case of intermediate and late fusions, each modality is fed into an independent deep neural network (DNN) and then fused to be the inputs of higher layers. The final representation is used for the classification task.

Graphical representation for the early, intermediate, and late fusions. (A) Early fusion (B) Intermediate fusion (C) Late fusion.

Based on the literature, five fusion operations are often used to fuse multiple modalities (Feng et al., 2021): Eq. 1 addition, Eq. 2 product, Eq. 3 concatenation, Eq. 4 ensemble, and Eq. 5 mixture of experts. Addition and product operations are performed in terms of elements at the fusion layer. Here, we will consider two more multimodal fusion techniques based on tensor fusion layer (TFL) (Zadeh et al., 2017) and multimodal circulant fusion (MCF) (Wu and Han, 2018) for early and intermediate fusions. When using TFL and MCF for the intermediate fusion, we actually use the DMNN model with FC(100)-FC(50) instead of FC(100)-FC(100)-FC(50) for each modality to improve its performance.

In general, the early fusion approach performs better than individual unimodal classifiers. The ensemble approach called late fusion is to weigh several individual classifiers and combine them to get a classifier that surpasses individual classifiers. In general, ensemble methods provide better results when there are significant differences among the models. Therefore, many ensemble methods try to enhance diversity among the models to be combined. Based on our preliminary studies, the unimodal classifiers using only chemical features perform better than unimodal classifiers using only target-based features. We actually have tried three different ensemble models using support vector machine (SVM) (Vapnik, 1995) and one-dimensional CNN and our DMNN for the late fusion in Figure 2. Note that our DMNN model uses only concatenation technique for late fusion. Since our DMNN ensemble model has shown the best performance, we will only report those results later.

We now briefly illustrate TFL and MCF strategies. Element-wise addition and product are used to join features from multiple modalities. Concatenation technique focuses more on learning intra-modality than learning inter-modality. However, both TFL and MCF capture both intra-modality and inter-modality dynamics. TFL also employs the same outer product on the augmented unimodal as in our OPCNN.

We first illustrate the idea of TFL strategy to fuse multimodal data at the tensor level. For our studies, we need to build a TFL that disentangles unimodal and bimodal dynamics. Given representative feature vectors f(1)R50 and f(2)R50 associated with the chemical feature vector x(1)R13 and the target-based feature vector x(2)R34 in different modalities, TFL calculates the outer product on the augmented unimodal using the Eq. 1. However, as seen from Figure 2, f(1)R50 and f(2)R50 are obtained slightly differently for the early fusion and intermediate fusion. Thus, TFL also produces two sets of information: the bimodal interactions in the form of two-dimensional tensor and the raw unimodal representations of the modalities. The tensor calculated by TFL is fed into a FC layer after being flattened. It is noted that TFL introduces no learnable parameters. Although TFL yields the high dimensional output tensor, chances of overfitting are low (Zadeh et al., 2017).

We now briefly illustrate the idea of MCF strategy which consists of four steps. Given representative feature vectors f(1)R50 and f(2)R50, we first project f(1) and f(2) to a lower dimensional space using projection matrices W1Rd×50 and W2Rd×50.

where d50. As in TFL, f(1)R50 and are obtained slightly differently for early fusion and intermediate fusion. Second, we construct circulant matrices ARd×d and BRd×d using the projection vector vRd and cRd.

where circ(b) denotes converting b to a circulant matrix. Third, we calculate in one of two ways: matrix multiplication between circulant matrix and projection vector to make elements in this matrix and vector fully interact. Two ways are illustrated in Eqs. 4, 5.

Here, ai and bi are column vectors of circulant matrices A and B, respectively. denotes the operation of element-wise product. It is noted that we introduce no new parameters in the multiplication operation. Finally, we calculate target vector mRk using f, g, and a projection matrix W3Rk×d.

Here, denotes the operation of element-wise addition.

Since the ratio of passed drugs to failed drugs in clinical trials is highly imbalanced, the class-imbalance problem occurs. There are generally three types of methods to deal with the imbalance data learning (Wang et al., 2019). We briefly illustrate the methods to be actually used in the study. 1) Sampling method: an intuitive way to cope with the imbalanced distribution of the data is to balance class distributions via resampling, which could oversample the minority class and undersample the majority class. One advanced sampling method called synthetic minority oversampling technique (SMOTE) creates artificial examples through interpolating neighboring data points (Chawla et al., 2002). Several variants of this technique have been proposed. However, oversampling can lead to overfitting due to repeatedly visiting the existing minority samples. On the other hand, undersampling can discard potentially useful information in majority samples. 2) Cost-sensitive learning method: instead of balancing class distributions via sampling methods, this method aims at coping with the abovementioned issues by directly imposing a heavier cost on misclassifying the minority class. However, what types of cost to use in different problem settings is still an open problem. In this study, we use the cost-sensitive learning method using the class weights (CWs) n/(2×n+) and n/(2×n) for the positive and negative classes, respectively. Recall that the majority class is the positive class and the minority class is the negative class in the study. Here, n represents the size of training dataset and n+ and n represent the sizes of the positive and negative classes, respectively. 3) Hybrid method: this is an approach that combines aforementioned two methods. In the study, we use the combination of SMOTE and CW techniques.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A