Text preprocessing

Zhan Ye; Ahmad P. Tafti; Karen Y. He; Kai Wang; Max M. He

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Text preprocessing

ZY Zhan Ye

AT Ahmad P. Tafti

KH Karen Y. He

KW Kai Wang

MH Max M. He

This method is extracted from research article: PLoS One, Sep 2016

SparkText: Biomedical Text Mining on Big Data Framework

DOI: 10.1371/journal.pone.0162721

Request a Protocol

Ask a question

Favorite

The first step of SparkText was text preprocessing in which we applied several preprocessing tasks on the raw text data (abstracts or full-text articles). This stage required a number of optional text preprocessing tasks, such as: (a) replacing special symbols and punctuation marks with blank spaces; (b) case normalization; (c) removing duplicate characters, rare words, and user-defined stop-words; and (c) word stemming. To this end, we first parsed pre-categorized (e.g., breast cancer, lung cancer, and prostate cancer by MeSH terms) abstracts or full-text articles into sentences. We then replaced special characters, such as quotation marks and other punctuation marks, with blank spaces and marked all sentences in a lower case format to provide normalized statements. Afterwards, we parsed sentences into individual words (tokens). Rare words and user-defined stop-words were removed, and the Porter Stemmer algorithm [25,26] was then used to stem all words.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol