To assess PPIO for its structure and functional features, we first applied it to capture PPI annotations from literature, which was conducted on an open standard corpus, annotating extracted PPIs based on PPIO and assessing the performance. Then, we employed PPIO to navigate PPI information.
Annotating PPIs based on PPIO. To annotate extracted PPIs, a PPIO-based approach was proposed to identify and assign PPIO terms that exist in the same sentence with the target PPI. The co-occurrence of PPI and PPIO term in one sentence suggests that the term represents a type of annotations of the PPI.
Corpus and preprocessing. A corpus named “BioCreAtIvE-PPI” [26] (See Table S3 in Additional file 3) was used to evaluate the efficacy of PPIO-based annotation extraction. This dataset originated from the BioCreAtIvE Task [27] corpus. A total of 173 sentences, which contained 255 interactions, were randomly selected from the BioCreAtIvE corpus by the original PPI curator. Based on these sentences which contained at least one PPI, six aspect additional annotations of PPI were curated manually by individual annotators according to the PPIO schema. In total, 71 Roles/Status of interactors, 91 biological processes (BPs), 17 subcellular locations (SCLs), 274 interaction types (ITs), 53 biological functions (BFs) and 43 detection methods (DMs) of PPIs were labeled on the original “BioCreAtIvE-PPI” corpus. This innovate curated corpus (See Table S4 in Additional file 4) was then used in the evaluation procedure. In order to create the reference corpus, the annotators were asked to keep in mind the breadth and depth of PPIO and to consider not only the superclass concepts but also their corresponding sub-class concepts as well as their synonyms for annotation.
Assigning annotations to related PPIs based on PPIO. We used the terms of PPIO as a dictionary for PPI annotation extraction. A PPIO-based approach which consists of three steps was proposed to accomplish the annotation task. First, a string matching algorithm was applied to recognize all the case-insensitive names and synonyms of the PPIO terms in sentences containing PPIs. Then, in the case of multiple matches, the longest match was selected. For instance, when the terms “regulation” and “regulation of transcription” were both identified, “regulation of transcription” was selected. Finally, the results were validated manually and the performance of the PPIO-based approach was evaluated using the curated corpus described above. The evaluation process focused on the performance comparison between the automatically assigned corpus and the manually curated corpus. Three commonly used features, i.e., precision, recall and F-score, were used to measure the performance of the PPI annotation extraction:
where true positive is the number of entities that were found by the PPIO-based text mining system, and those matched the annotations in the curated corpus, false positive is the number of entities that were automatically assigned by the PPIO-based text mining system but could not be matched to any annotations in the manually curated corpus, and false negative is the number of entities that were not found by the PPIO-based approach when compared with the manually curated annotations. Higher precision, recall and F-score indicate high performance. Further details of evaluation material and methods are provided in Additional file 13.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.