The medical informatics community has also extensively studied relation extraction in the form of shared tasks and separately motivated research. For example, significant advances in extracting semantic relations from narrative text in EMRs have been documented in the i2b2/VA-2010 challenge (i2b2—Informatics for Integrating Biology to the Bedside, VA—Veterans Association) [1].
The challenge had three tasks including concept extraction, assertion classification and relation classification, participated by numerous international teams [1]. Concept extraction can be considered the basic task, as assertions and relations all refer to the extracted concepts. As the challenge allows relation classification to use the ground truth of concepts extraction, the performance metrics for relation classification should be interpreted as an upper bound for the end-to-end relation extraction task (same as the challenges from BioNLP, BioCreative and DDIExtraction). In this section, we review only the relation classification systems, where the target relations are predefined among medical problems, tests and treatments. These relations include ‘treatment improves / worsens / causes / is administered for / is not administered because of medical problem', ‘test reveals / conducted to investigate medical problem' and ‘medical problem indicates medical problem'. As in reviewing the above challenges, we review only those systems that represented sentences as graphs and explored such graphs during the feature-generation step.
Roberts et al. [93] classified the semantic relations using a rather comprehensive set of features: context features (e.g. n-grams, GENIA part-of-speech tags surrounding medical concepts), nested relation features (relations in the text span between candidate pairs of concepts), single concept features (e.g. covered words and concept types), Wikipedia features (e.g. concepts matching Wikipedia titles), concept bi-grams features and similarity features. The latter were computed using edit distance on language constructs including GENIA phrase chunks and Stanford Dependency shortest paths. Their system reached the highest f-measure on relation classification (0.737). deBruijn et al. [94] applied a maximum entropy classifier with down sampling applied to balance the relation distribution. They applied the McClosky-Charniak-Johnson parser/Stanford Dependency pipeline, and included as features the dependency paths between the minimal trees that cover the concept pairs. They used word clusters as features to address the problem of unseen words. Their system reached the second best f-measure of 0.731. Solt et al. [96] then experimented with several parsers including the Stanford Parser, the McClosky-Charniak-Johnson Parser and the Enju Parser. They used the resulting dependency graphs with two graph kernels including the all paths graph (APG) kernel [20] and kBSPS [99], which produced only moderate performance. This likely reflects the difficulty in tuning the graph/tree kernel-based systems, consistent with the observations from the experience in relation/event extraction from the scientific literature.
The SemEval 2015 Task 14 included disorder identification and disorder slot filling tasks [155]. Disorder identification is essentially named entity detection, and disorder slot filling is similar to BioNLP event extraction tasks but in clinical subdomain. The challenge further divided the slot filling task into two subtasks, one with gold-standard disorder spans (task 2a) and one without (task 2b). Thus, task 2b has stricter evaluation results than task 2a. The attribute slots defined by the challenge include concept unique identifier (CUI), negation (NEG), subject (SUB), uncertainty (UNC), course (COU), severity (SEV), conditional (CND), generic (GEN) and body location (BL). Identifying the CUI is the named entity-detection problem, and identifying negation and uncertainty is the assertion classification problem. Identifying SUB, COU, SEV, CND, GEN and BL are more analogous to binary relation extraction. They are not completely equivalent to binary relation extraction, as the challenge limited the possible values for those slots, adding a layer of abstraction.
The challenge used weighted accuracy to rank the participants. Xu et al. [97] and Pathak et al. [98] consistently ranked as the top two teams in both task 2a (0.886 and 0.880, respectively) and task 2b (0.808 and 0.795, respectively). Xu et al. used Conditional Random Field (CRF) as the classifier for BL slot filling and SVM as the classifier for the other slots. The SVM classifier additionally used dependencies coming into and out of the disorder mentions. Note that these dependencies cannot capture multi-hop syntax dependence, but the authors observed that NEG/UNC/COU/SEV/GEN always have one-hop dependence. On the other hand, CRF (for BL) is itself a graph-based model that treats tokens and hidden states as nodes (integrating semantic and syntactic features including n-grams, context words, dictionaries and section names) and interconnects nodes with transition and emission edges [156]. Pathak et al. divided slot detection into two parts: detecting keywords and relating keywords with disorder mentions. They used dictionary look-up combined with CRF trained on features such as bag-of-words and orthographic features to detect keywords. To relate keyword with disorder mentions, they trained SVM using features similar to Xu et al. plus Part-of-Speech tags. Other teams used explicit graph-mining algorithms [157, 159] and but did not perform as competitively. For example, Hakala et al. [157] tackled task 2a by adapting TEES system to work with SemEval data format and achieved a weighted accuracy of 0.857, placing the third. This is not surprising, as given many slots only involve one-hop dependencies, full-fledged graph-based approach only offers limited benefits. In addition, the controlled vocabulary and controlled format nature of challenge tasks makes themselves suitable for CRF, as limited number of states and state-transitions lead to less sparse and more robust probability estimation.
After the i2b2 challenges, several authors aimed at combining the concept and relation extraction steps into an integral pipeline and/or generalizing to the extraction of complex or even nested relations. Xu et al. [95] developed a rule-based system MedEx to extract medications and specific relations between medications and their associated strengths, routes and frequencies. The MedEx system converts narrative sentences in clinical notes into conceptual graph representations of medication relations. To do so, Xu et al. designed a semantic grammar directly mappable to conceptual graphs and applied the Kay Chart Parser [160] to parse sentences according to this grammar. They also used a regular-expression-based chunker to capture medications missed by the Kay Chart Parser. Weng et al. [36] applied a customized syntactic parser on text specifying clinical eligibility criteria. They mined maximal frequent subtree patterns and manually aggregated and enriched them with the UMLS to form a semantic representation for eligibility criteria, which aims to enable semantically meaningful search queries over ClinicalTrials.gov. Luo et al. [49] augmented the Stanford Parser with UMLS-based concept recognition to accurately generate graph representations for sentences in pathology reports where the graph nodes correspond to medical concepts. Frequent subgraph mining was then used to collect important semantic relations between medical concepts (e.g. which antigens are expressed on neoplastic lymphoid cells), which serve as the basis for classifying lymphoma subtypes. Extending the subgraph-based feature generation into unsupervised learning, Luo et al. [50] further used tensor factorization to group subgraphs. The intuition is that each subgraph corresponds to a test result, and a subgraph group represents a panel of test results, as typically used in diagnostic guidelines. The tensors incorporated three dimensions: patients, common subgraphs and individual words in each report. The words helped better group subgraphs to recover lymphoma subtype diagnostic criteria.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.