For training the CRF model, we select 4 types of features, BOC, POS tags, character types (CT), as well as the position of the character in the sentence (POCIS). NLPIR Chinese word segmentation system Institute of Computing Technology, Chinese Lexical Analysis System (ICTCLAS)-2016 [30] is utilized for word segmentation. While using ICTCLAS-2016 for segmentation, POS tags are generated simultaneously. As we use character-level information instead of word-level information, the POS tag of the Chinese character is just the POS tag of the corresponding word, which contains that character. In addition, we manually classify all the characters in the EHR dataset into 5 CT (including W: common character; D: numbers; L: letters; S: ending punctuation; and P: symmetrical punctuation). To validate the effectiveness of the bidirectional LSTM-CRF model in identifying clinical entities with much less feature engineering than the CRF model, different combinations of features are fed into the CRF model.
For training the bidirectional LSTM-CRF model, we employ the character embeddings and segmentation information as our features. Character embeddings are learned through Google’s word2vec [31] on the 2605 patients’ unlabeled dataset. The segmentation information is generated by the Jieba segmentation system [32].
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.