2.2.5.  Word2vec method-based ICD-10 synonym identification

Xiaowen Ruan; Yue Li; Xiaohui Jin; Pan Deng; Jiaying Xu; Na Li; Xian Li; Yuqi Liu; Yiyi Hu; Jingwen Xie; Yingnan Wu; Dongyan Long; Wen He; Dongsheng Yuan; Yifei Guo; Heng Li; He Huang; Shan Yang; Mei Han; Bojin Zhuang; Jiang Qian; Zhenjie Cao; Xuying Zhang; Jing Xiao; Liang Xu

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.2.5. Word2vec method-based ICD-10 synonym identification

XR Xiaowen Ruan

YL Yue Li

XJ Xiaohui Jin

PD Pan Deng

JX Jiaying Xu

NL Na Li

XL Xian Li

YL Yuqi Liu

YH Yiyi Hu

JX Jingwen Xie

YW Yingnan Wu

DL Dongyan Long

WH Wen He

DY Dongsheng Yuan

YG Yifei Guo

HL Heng Li

HH He Huang

SY Shan Yang

MH Mei Han

BZ Bojin Zhuang

JQ Jiang Qian

ZC Zhenjie Cao

XZ Xuying Zhang

JX Jing Xiao

LX Liang Xu

This method is extracted from research article: Lancet Reg Health West Pac, Mar 2021

Health-adjusted life expectancy (HALE) in Chongqing, China, 2017: An artificial intelligence and big data method estimating the burden of disease at city level

DOI: 10.1016/j.lanwpc.2021.100110

Request a Protocol

Ask a question

Favorite

One particular disease might be described with different expressions such that polysemy occurs frequently in the extracted disease entities. To normalise these disease expressions, we applied a word2vec model to encode the disease entities detected in the NER step into vector representations, and mapped them to the corresponding ICD10 disease names. We trained an unsupervised word representation model on 0•955 million diagnostic notes, 1•08 million word entries from a publicly available knowledge base and 69K sentences from medicine books. Similar to previous studies that utilised ord2vec for synonym extraction relying on semantic space similarity, we employed word representations for the alignment of surface terms detected by NER to standard ICD terms that were the most semantically relevant. The disease mentions extracted from EMRs and the standard disease names in ICD-10 were first encoded into lower dimensional embeddings. The terms in ICD-10 are usually longer than the extracted entities from free text. Thus, we first applied Chinese word segmentation to the ICD-10 term, and then transformed segmented words into dense vectors, which were weight-averaged to derive the final embedding of the ICD-10 term. To improve the quality of result representations, the principal component (derived by performing PCA on the word embedding matrix) was subtracted from the word embeddings in order to eliminate the dominating directions and the mean values of the vector space, thus making the embeddings stronger [34]. We then measured the cosine similarity between the dense vectors of the extracted entities and the ICD-10 term. To define the best cut-off threshold in similarity measure, we created ground-truth data by aligning 6795 of the 7584 terms detected by NER (789 unaligned) in the previous step to ICD-10 names. We automated a testing procedure to evaluate our trained model on these ground-truth data and adjusted a threshold (0•68) that yielded optimum F1-score. The disease name in ICD-10 with the highest similarity score above the adjusted threshold of 0•68 was selected as a normalisation outcome. More detailed explanations of the above two sections can be found in the Supplementary Information.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol