Update README.md
Browse files
README.md
CHANGED
@@ -17,6 +17,44 @@ See [Zhang et al., 2021](https://arxiv.org/abs/2112.07887) for the details.
|
|
17 |
|
18 |
Note that some prior systems like [BioSyn](https://aclanthology.org/2020.acl-main.335.pdf), [SapBERT](https://aclanthology.org/2021.naacl-main.334.pdf), and their follow-up work (e.g., [Lai et al., 2021](https://aclanthology.org/2021.findings-emnlp.140.pdf)) claimed to do entity linking, but their systems completely ignore the context of an entity mention, and can only predict a surface form in the entity dictionary (See Figure 1 in [BioSyn](https://aclanthology.org/2020.acl-main.335.pdf)), _**not the canonical entity ID (e.g., CUI in UMLS)**_. Therefore, they can't disambiguate ambiguous mentions. For instance, given the entity mention "_ER_" in the sentence "*ER crowding has become a wide-spread problem*", their systems ignore the sentence context, and simply predict the closest surface form, which is just "ER". Multiple entities share this surface form as a potential name or alias, such as *Emergency Room (C0562508)*, *Estrogen Receptor Gene (C1414461)*, and *Endoplasmic Reticulum(C0014239)*. Without using the context information, their systems can't resolve such ambiguity and pinpoint the correct entity *Emergency Room (C0562508)*. More problematically, their evaluation would deem such an ambiguous prediction as correct. Consequently, the reported results in their papers do not reflect true performance on entity linking.
|
19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
## Citation
|
21 |
|
22 |
If you find KRISSBERT useful in your research, please cite the following paper:
|
@@ -30,4 +68,5 @@ If you find KRISSBERT useful in your research, please cite the following paper:
|
|
30 |
eprinttype = {arXiv},
|
31 |
eprint = {2112.07887},
|
32 |
}
|
33 |
-
```
|
|
|
|
17 |
|
18 |
Note that some prior systems like [BioSyn](https://aclanthology.org/2020.acl-main.335.pdf), [SapBERT](https://aclanthology.org/2021.naacl-main.334.pdf), and their follow-up work (e.g., [Lai et al., 2021](https://aclanthology.org/2021.findings-emnlp.140.pdf)) claimed to do entity linking, but their systems completely ignore the context of an entity mention, and can only predict a surface form in the entity dictionary (See Figure 1 in [BioSyn](https://aclanthology.org/2020.acl-main.335.pdf)), _**not the canonical entity ID (e.g., CUI in UMLS)**_. Therefore, they can't disambiguate ambiguous mentions. For instance, given the entity mention "_ER_" in the sentence "*ER crowding has become a wide-spread problem*", their systems ignore the sentence context, and simply predict the closest surface form, which is just "ER". Multiple entities share this surface form as a potential name or alias, such as *Emergency Room (C0562508)*, *Estrogen Receptor Gene (C1414461)*, and *Endoplasmic Reticulum(C0014239)*. Without using the context information, their systems can't resolve such ambiguity and pinpoint the correct entity *Emergency Room (C0562508)*. More problematically, their evaluation would deem such an ambiguous prediction as correct. Consequently, the reported results in their papers do not reflect true performance on entity linking.
|
19 |
|
20 |
+
|
21 |
+
## Usage of KRISSBERT for Entity Linking
|
22 |
+
|
23 |
+
Here, we use the [MedMentions](https://github.com/chanzuckerberg/MedMentions) data to show you how to 1) **generate prototype embeddings**, and 2) **run entity linking**.
|
24 |
+
|
25 |
+
(We are currently unable to release the self-supervised mention examples, because they require the UMLS and PubMed licenses.)
|
26 |
+
|
27 |
+
|
28 |
+
#### 1. Create conda environment and install requirements
|
29 |
+
```bash
|
30 |
+
conda create -n kriss -y python=3.8 && conda activate kriss
|
31 |
+
pip install -r requirements.txt
|
32 |
+
```
|
33 |
+
|
34 |
+
#### 2. Switch the root dir to [usage](https://huggingface.co/microsoft/BiomedNLP-KRISSBERT-PubMed-UMLS-EL/tree/main/usage)
|
35 |
+
```bash
|
36 |
+
cd usage
|
37 |
+
```
|
38 |
+
|
39 |
+
#### 3. Download the MedMentions dataset
|
40 |
+
|
41 |
+
```bash
|
42 |
+
git clone https://github.com/chanzuckerberg/MedMentions.git
|
43 |
+
```
|
44 |
+
|
45 |
+
#### 4. Generate prototype embeddings
|
46 |
+
```bash
|
47 |
+
python generate_prototypes.py
|
48 |
+
```
|
49 |
+
|
50 |
+
#### 5. Run entity linking
|
51 |
+
```bash
|
52 |
+
python run_entity_linking.py
|
53 |
+
```
|
54 |
+
|
55 |
+
This will give you about `58.3%` top-1 accuracy.
|
56 |
+
|
57 |
+
|
58 |
## Citation
|
59 |
|
60 |
If you find KRISSBERT useful in your research, please cite the following paper:
|
|
|
68 |
eprinttype = {arXiv},
|
69 |
eprint = {2112.07887},
|
70 |
}
|
71 |
+
```
|
72 |
+
[https://arxiv.org/pdf/2112.07887.pdf](https://arxiv.org/pdf/2112.07887.pdf)
|