2.2. Why Jaccard Similarity?

YC Yu Chen
WW Wei Wang
JL Jiale Liu
JF Jinping Feng
XG Xinqi Gong
request Request a Protocol
ask Ask a question
Favorite

Whether for social networks or PPI networks, the core idea of network-based link prediction methods we mentioned in section 2.1 is to design a similarity measure between nodes for their networks, which determines the likelihood of the linkage between each pair of nodes. Therefore, these similarity measures in Table 1 are directly used as indices of link prediction. There are different reasons for the selection of their similarity measure, respectively, such as Preferential Attachment (Barabâsi et al., 2002), Resource Allocation (Zhou et al., 2009), and Reciprocal Relationship (Dick and Green, 2018), etc. The index we are going to propose is still network-based, but there are two differences between our method and the previous ones in Table 1:

The similarity used in our index is Jaccard Similarity (Jaccard, 1912). In this subsection, we will explain the reasons for choosing Jaccard Similarity from two aspects.

Unlike the indices in Table 1, we do not directly use the Jaccard Similarity between node u and v to predict links between them, but use the linear combination of Jaccard Similarities between one node's neighbors and the other node. We will explain the reasons in this subsection and section 2.4.

There are two reasons for choosing Jaccard Similarity:

For PPI networks, proteins with similar structures share similar interaction interfaces (Norel et al., 1994). Therefore, the two interacting proteins have complementary interfaces to each other (see Figure 2). In other words, two proteins with similar interfaces are likely to share more interacting neighbors rather than interacting with each other. The more similar they are, the greater proportion of common neighbors. We mentioned earlier that the structural information of proteins is not easy to obtain, but we can infer the similarity between them by the proportion of their common neighbors, i.e., Jaccard Similarity (Jaccard, 1912).

The structures of six dimers, from: (RCSB PDB). Two interacting monomers are represented by different colors. Their interaction interfaces are complementary.

where Γi is the set of neighbors of node i in the PPI network.

Figure 3 shows an example of the correlation between interface similarity and Jaccard Similarity. We can see from the naked eye that Camk2d and Camk2g have similar 3D structure, which leads to their similar interaction interfaces. Therefore, they may share a large proportion of interaction neighbors who have complementary interfaces to them, respectively, i.e., Jaccard Similarity. Since not every protein in PPI network has already known 3D structure, we use the global alignment algorithm (Needleman and Wunsch, 1970) to measure the similarity between protein pairs. The alignment score of Camk2d and Camk2g is 0.83395, which means they have very similar amino acid sequences. Furthermore, we also found evidence in the database (PhylomeDB) that they are paralogues derived from gene duplication events. Figure 4 shows the mean pairwise alignment score in each interval of Jaccard Similarity. With the increase of Jaccard Similarity, the mean alignment score is also increasing. That means that protein pairs with high Jaccard Similarities are more likely to have similar amino acid sequences, structures and interaction interfaces. Therefore, the first potential reason for high Jaccard Similarity is high interface similarity, and a protein with complementary interface to them becomes their common neighbor. For example, if a protein C interacts with A which has high Jaccard Similarity with B, then C may also interact with B because the interface complementarity between C and A leads to the possibility that the interfaces of C and B are also complementary.

Camk2d, Camk2g, and their neighbors in the PPI network of Rattus norvegicus, from: (STRING). The similarity of their interaction interfaces determines that they can share a large proportion of interaction partners.

Mean alignment score in each interval of Jaccard Similarity. On the whole, the alignment score of high Jaccard Similarity node pairs is higher than that of low Jaccard Similarity node pairs.

Another reason for choosing Jaccard Similarity is from gene duplication (Zhang, 2003; Dehal and Boore, 2005). In the process of evolution, genes may produce new products: new proteins, which may retain many of the properties of the original ones and consequently preserve many interacting partners.

We find several proteins that are recorded as the products of gene duplication events from (PhylomeDB), and then generate several organisms' PPI networks containing them from (STRING). We delete links with confidence <0.7 in the PPI networks to ensure the reliability. The Jaccard Similarities and alignment scores of these protein pairs are shown in Table 2. We can see that protein pairs from gene duplication events have high Jaccard Similarities and alignment scores. Because the products of gene duplication have similar amino acid sequence, which leads to the similarity of their structures and interaction interfaces. As a result, they share a large proportion of interaction neighbors in PPI network, i.e., they have high Jaccard Similarities. Therefore, the second potential reason for a high Jaccard Similarity protein pair is that they are products of gene duplication, and a protein that interact with one of them is likely to interact with the other one. For example, if a protein C interacts with A which is a product of gene duplication to B, then C may interact with B too.

Comparison of alignment scores of protein pairs produced by gene duplication events and Jaccard Similarities of them.

PPI networks are from (STRING), and the evidence of gene duplication events are from (PhylomeDB).

To sum up, proteins with complementary interfaces interact with each other, and proteins with similar interfaces share interacting partners; the similarity of gene duplication products leads to sharing interacting neighbors. And Jaccard Similarity can reflect the interface similarity between protein pairs as well as the similarity between gene duplication products. Based on that, we propose a basic assumption: the more similar proteins are, the more likely they are to share more interacting partners, rather than interacting with each other. This is the basis for the index Sim we will propose in section 2.4. In the following subsections, in order to further verify the rationality of using Jaccard Similarity for similarity measure, we will compare the performances of our index Sim with indices using other similarity measures on two types of random networks and real PPI networks.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A