Term Frequency-Inverse Document Frequency Vectorization

Malini Mahendra; Yanting Luo; Hunter Mills; Gundolf Schenk; Atul J. Butte; R. Adams Dudley

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Term Frequency-Inverse Document Frequency Vectorization

MM Malini Mahendra

YL Yanting Luo

HM Hunter Mills

GS Gundolf Schenk

AB Atul J. Butte

RD R. Adams Dudley

This method is extracted from research article: Crit Care Explor, Jun 2021

Impact of Different Approaches to Preparing Notes for Analysis With Natural Language Processing on the Performance of Prediction Models in Intensive Care

DOI: 10.1097/CCE.0000000000000450

Request a Protocol

Ask a question

Favorite

TF-IDF vectorization is a method to adjust for how commonly a term is found in note text. TF-IDF is a calculated numerical weight. Term frequency is calculated as how often a term appears in the entire corpus for an individual patient divided by how much was written about the patient (measured by how many terms there are in the corpus for that patient) (20). In NLP, term frequency is usually multiplied by inverse document frequency (IDF). IDF represents how commonly a particular term is found across all patients and is calculated in this case as the log of the number of patients (since each patient has a single document, their corpus), divided by the number of patients with the term. Therefore, if every patient has the term (an example in the ICU might be “IV,” since nearly all ICU patients have an IV catheter), then the ratio of all patients to all patients with the term is near 1. Because the log of 1 is 0, the term IV would be zeroed out and would not be included as a predictor in the statistical model. Rare terms that are present in few documents will have an IDF closer to 1 (21). When the TF and IDF are multiplied together, the numerical weight for a particular term is generated. The mathematical equations used to calculate TF-IDF are shown in Supplemental Figure 1 (http://links.lww.com/CCX/A655). The numerical weight for each term is then incorporated into statistical models. Once text is transformed into numeric outputs, the text is referred to as “featurized data.”

This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial-No Derivatives License 4.0 (CCBY-NC-ND), where it is permissible to download and share the work provided it is properly cited. The work cannot be changed in any way or used commercially without permission from the journal.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol