Input corpus

LB Lucie Beranová
MJ Marcin P. Joachimiak
TK Tomáš Kliegr
GR Gollam Rabby
VS Vilém Sklenák
ask Ask a question
Favorite

In our work, we used two versions of CORD-19. As shown in Fig. Fig.22 and as described and justified below, we performed different preprocessing for each of the versions.

Process of collecting data and data reduction

Dataset version 1. As the first version of the CORD-19 corpus, we used a release from 2020-10-12 containing approximately 300,000 articles. Since the acquisition of citations and additional entity annotations from external APIs was demanding in terms of time, we limited the number of processed documents by using only the first 10% of articles (30,000) in CORD-19. As a source of additional metadata, we used an additional corpus called CORD-19-on-FHIR1, which contains semantic annotations mostly generated using Pubtator.2 By matching “pubmedid” identifiers from CORD-19-on-FHIR, we were able to retrieve metadata for 5174 articles in the used subset of CORD-19. For 2940 of articles with non-empty abstracts, citation counts from the OpenCitations database API were successfully retrieved. The last reduction of the data resulted from the unavailability of bibliometric data—Journal Quality Measures and Categories for some articles. The final composition of our V1 dataset consisted of 2,223 articles with all necessary information available. All selected articles are in English.

Dataset version 2 To investigate the effect of increasing the dataset size on the accuracy, we checked whether the results based on the smaller sample are sufficiently representative. For this analysis, which was performed as part of a revision of this article, we used a newer release of the CORD-19 corpus (2021-6-22). We also used an up-to-date list of citations, which was retrieved from OpenCitations database dump rather than from the OpenCitations API as in the V1 version. This allowed the processing of a larger number of documents in a timely manner. Note that we have not performed the consequent filtering steps as in V1. In particular, FHIR-to-CORD19 was not updated to match newer CORD-19 versions, and its use would thus excessively reduce the size of the corpus. We also removed articles published in 2021, since for these articles only very limited citation data was available. As a result, the V2 version consisted of 72,336 articles. The distribution of the citations is visualized in Fig. 5.

Distribution of highly vs lowly cited articles before the normalization by age (left) and after normalization (right) for the V2 dataset

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A