One particular disease might be described with different expressions such that polysemy occurs frequently in the extracted disease entities. To normalise these disease expressions, we applied a word2vec model to encode the disease entities detected in the NER step into vector representations, and mapped them to the corresponding ICD10 disease names. We trained an unsupervised word representation model on 0•955 million diagnostic notes, 1•08 million word entries from a publicly available knowledge base and 69K sentences from medicine books. Similar to previous studies that utilised ord2vec for synonym extraction relying on semantic space similarity, we employed word representations for the alignment of surface terms detected by NER to standard ICD terms that were the most semantically relevant. The disease mentions extracted from EMRs and the standard disease names in ICD-10 were first encoded into lower dimensional embeddings. The terms in ICD-10 are usually longer than the extracted entities from free text. Thus, we first applied Chinese word segmentation to the ICD-10 term, and then transformed segmented words into dense vectors, which were weight-averaged to derive the final embedding of the ICD-10 term. To improve the quality of result representations, the principal component (derived by performing PCA on the word embedding matrix) was subtracted from the word embeddings in order to eliminate the dominating directions and the mean values of the vector space, thus making the embeddings stronger [34]. We then measured the cosine similarity between the dense vectors of the extracted entities and the ICD-10 term. To define the best cut-off threshold in similarity measure, we created ground-truth data by aligning 6795 of the 7584 terms detected by NER (789 unaligned) in the previous step to ICD-10 names. We automated a testing procedure to evaluate our trained model on these ground-truth data and adjusted a threshold (0•68) that yielded optimum F1-score. The disease name in ICD-10 with the highest similarity score above the adjusted threshold of 0•68 was selected as a normalisation outcome. More detailed explanations of the above two sections can be found in the Supplementary Information.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.