Calculate TF-IDF and cosine similarities

Margaret Mahan; Daniel Rafter; Hannah Casey; Marta Engelking; Tessneem Abdallah; Charles Truwit; Mark Oswood; Uzma Samadani

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Calculate TF-IDF and cosine similarities

MM Margaret Mahan

DR Daniel Rafter

HC Hannah Casey

ME Marta Engelking

TA Tessneem Abdallah

CT Charles Truwit

MO Mark Oswood

US Uzma Samadani

This method is extracted from research article: PLoS One, Jan 2020

tbiExtractor: A framework for extracting traumatic brain injury common data elements from radiology reports

DOI: 10.1371/journal.pone.0214775

Request a Protocol

Ask a question

Favorite

Using scikit-learn [27] TfidfVectorizer, the corpus was converted into a matrix of TF-IDF (term-frequency times inverse document-frequency) features using n-grams with n-range from one to ten. Cosine similarities were calculated between each pair of radiology reports by multiplying the TF-IDF matrix by its transpose. Using the cosine similarity for each pair of radiology reports, one radiology report was randomly selected and all radiology reports with at least 0.70 cosine similarity to that radiology report were collected in a set. From this set, one radiology report was randomly selected to keep for further analysis and the remainder were removed. This was applied recursively for each set until each radiology report was retained for further analysis or marked for removal. The purpose of this removal was to reduce the data requiring human annotation. Details in S2 Appendix.

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol