We used n-grams as features. In clinical text, unigrams are single words, and bigrams are two words that occur in a sequence. For example, in the phrase “patient owns a shotgun” the unique unigrams are patient, owns, a, and shotgun. In the same phrase, patient_owns, owns_a, a_shotgun are unique bigrams. Alpha or numeric tokens (discrete words and numbers) were counted in the unigrams and bigrams. The features included unique unigrams with a frequency greater than 34, and unique bigrams in the annotation spans with a frequency greater than four. These threshholds are empirically chosen to filter out the less prevalent n-grams and reduce overfitting. The training features for the model (for each document) consisted of binary indications of the presence of each of the identified unigrams and bigrams, along with the offset location of the keyphrase in the snippet.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.