2.1. Benchmark Datasets

Dan Zhang; Hua-Dong Chen; Hasan Zulfiqar; Shi-Shi Yuan; Qin-Lai Huang; Zhao-Yue Zhang; Ke-Jun Deng

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.1. Benchmark Datasets

DZ Dan Zhang

HC Hua-Dong Chen

HZ Hasan Zulfiqar

SY Shi-Shi Yuan

QH Qin-Lai Huang

ZZ Zhao-Yue Zhang

KD Ke-Jun Deng

This method is extracted from research article: Comput Math Methods Med, Jan 2021

iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins

DOI: 10.1155/2021/6664362

Request a Protocol

Ask a question

Favorite

A reliable data [16–18] is necessary for a robust model. The benchmark datasets constructed by Zhang et al. [15] were used in our work. It contained 17,403 BLPs composed of three species, namely, bacteria, eukaryote, and archaea, which were collected from UniProt (Jul. 2016). Therefore, four benchmark datasets were generated corresponding to a general and three species-specific datasets (bacteria, eukaryote, and archaea). To avoid homology bias and remove redundant sequences from the benchmark datasets, BLASTClust [19] was utilized to cluster all these protein sequences by setting the cutoff of sequence identity at 30%. And then, one protein was randomly picked from each cluster as the representative. Thus, 863 BLPs were obtained as positive samples. Among these BLPs, 748 belong to bacteria, 70 belong to eukaryote, and 45 belong to archaea. Additionally, 7093 nonredundant non-BLPs were collected to construct the negative samples that consist of 4919, 1426, and 748 proteins of bacteria, eukaryote, and archaea, respectively. Moreover, to construct balanced training dataset, 80% of the positive samples and equal number of negative samples were randomly picked out for training model. The rest positive and negative samples were used for independent testing. As a result, the final four benchmark datasets are constructed and summarized in Table 1. All data are available at http://lin-group.cn/server/iBLP/download.html.

The constructed benchmark datasets for BLP prediction.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol