BioWiC35 instances follow a similar format to WiC, where each instance involves a pair of biomedical terms (w1 and w2) and their corresponding sentences (s1 and s2). The task is to classify each instance as True if the target terms carry the same meaning across both sentences or False if they do not. We represent each instance as a tuple pair t = [(s1,w1),(s2,w2)]: y where w1 and w2 are the target terms, s1 and s2 are the corresponding sentences, and y is the associated binary label. Table 2 presents some examples of BioWiC instances. In contrast to WiC, where both target terms of each instance always share the same lemma, BioWiC allows for variations such as abbreviations, synonyms, identical terms, and terms with similar surface forms.
BioWiC35 instances, drawn from the test split.
The target terms of each instance are in bold.
To evaluate challenging scenarios for semantic representation, such as synonymy, polysemy, and abbreviations, BioWiC35 is divided into four main groups of instances. Group A (term identity) contains instances where the target terms w1 and w2 are identical. In group B (abbreviations), either w1 or w2 could represent the abbreviation of the other one. Group C (synonyms), consists of instances where w1 and w2 could be synonyms (according to UMLS). Lastly, group D (label similarity) includes instances where w1 and w2 share similar surface forms. We employed the following five steps to generate the BioWiC instances:
For clarity, in Fig. 2 we provide an example of building BioWiC35 instances for the target term “delivery”. Initially, we preprocess the resource data and extract all sentences in which “delivery” is linked to UMLS. We transform each sentence to the sentence-term tuple (si,w) format where si represents a sentence containing the term w = “delivery”. Subsequently, we permute all possible combinations of tuples (si,w) identified in the preceding step to generate BioWiC instances t = [(si,w),(sj,w)], where “delivery” serves as the target term in both sentences. Finally, we classify each instance as True when “delivery” is mapped to the same CUI code in both sentences and as False when it is not.
The overall pipeline of the BioWiC35 construction process. Step 1: Pre-process the source documents to a consistent format. Step 2: Identify and retrieve sentences including the term “delivery” linked to UMLS. Step 3: Pair the retrieved sentences to generate BioWiC instances. In Step 3, the green box shows an example of a BioWiC instance with the same target concept, while the red boxes show examples of different target concepts.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.