Commit
·
1f7bd2d
1
Parent(s):
e172865
Update README.md
Browse files
README.md
CHANGED
@@ -50,7 +50,7 @@ license: other
|
|
50 |
## DOCUMENTATION UPDATES IN PROGRESS
|
51 |
|
52 |
## Model description
|
53 |
-
**EpiExtract4GARD** is a fine-tuned [BioBERT-base-cased](https://huggingface.co/dmis-lab/biobert-base-cased-v1.1) model that is ready to use for **Named Entity Recognition** of locations (LOC), epidemiologic types (EPI), and epidemiologic rates (STAT). This model was fine-tuned on
|
54 |
|
55 |
#### How to use
|
56 |
You can use this model with the Hosted inference API to the right with this [test sentence](https://pubmed.ncbi.nlm.nih.gov/21659675/): "27 patients have been diagnosed with PKU in Iceland since 1947. Incidence 1972-2008 is 1/8400 living births."
|
@@ -116,32 +116,37 @@ B-EPI | Beginning of an epidemiologic type (e.g. "incidence", "prevalence", "
|
|
116 |
I-EPI | Epidemiologic type that is not the beginning token.
|
117 |
B-STAT | Beginning of an epidemiologic rate
|
118 |
I-STAT | Inside of an epidemiologic rate
|
|
|
119 |
|
120 |
### EpiSet Statistics
|
121 |
|
122 |
-
Beyond any limitations due to the EpiSet4NER dataset, this model is limited in numeracy due to BERT-based model's use of subword embeddings, which is crucial for epidemiologic rate identification and limits the entity-level results.
|
123 |
|
124 |
## Training procedure
|
125 |
This model was trained on a [AWS EC2 p3.2xlarge](https://aws.amazon.com/ec2/instance-types/), which utilized a single Tesla V100 GPU, with these hyperparameters:
|
126 |
-
4 epochs of training (AdamW weight decay = 0.05) with a batch size of 16. Maximum sequence length = 192. Model was fed one sentence at a time.
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
|
141 |
-
|
142 |
-
|
|
143 |
-
| | | Location | 0.
|
144 |
-
| | | Epidemiologic Type |
|
145 |
-
| | | Epidemiologic Rate | 0.
|
|
|
|
|
|
|
|
|
146 |
|
147 |
Thanks to [@William Kariampuzha](https://github.com/wzkariampuzha) at Axle Informatics/NCATS for contributing this model.
|
|
|
50 |
## DOCUMENTATION UPDATES IN PROGRESS
|
51 |
|
52 |
## Model description
|
53 |
+
**EpiExtract4GARD-v2** is a fine-tuned [BioBERT-base-cased](https://huggingface.co/dmis-lab/biobert-base-cased-v1.1) model that is ready to use for **Named Entity Recognition** of locations (LOC), epidemiologic types (EPI), and epidemiologic rates (STAT). This model was fine-tuned on EpiSet4NER-v2 for epidemiological information from rare disease abstracts. See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations. See [EpiExtract4GARD on GitHub](https://github.com/ncats/epi4GARD/tree/master/EpiExtract4GARD#epiextract4gard) for details on the entire pipeline.
|
54 |
|
55 |
#### How to use
|
56 |
You can use this model with the Hosted inference API to the right with this [test sentence](https://pubmed.ncbi.nlm.nih.gov/21659675/): "27 patients have been diagnosed with PKU in Iceland since 1947. Incidence 1972-2008 is 1/8400 living births."
|
|
|
116 |
I-EPI | Epidemiologic type that is not the beginning token.
|
117 |
B-STAT | Beginning of an epidemiologic rate
|
118 |
I-STAT | Inside of an epidemiologic rate
|
119 |
+
+More | Description pending
|
120 |
|
121 |
### EpiSet Statistics
|
122 |
|
123 |
+
Beyond any limitations due to the EpiSet4NER dataset, this model is limited in numeracy due to BERT-based model's use of subword embeddings, which is crucial for epidemiologic rate identification and limits the entity-level results. Recent techniques in numeracy could be used to improve the performance of the model without improving the underlying dataset.
|
124 |
|
125 |
## Training procedure
|
126 |
This model was trained on a [AWS EC2 p3.2xlarge](https://aws.amazon.com/ec2/instance-types/), which utilized a single Tesla V100 GPU, with these hyperparameters:
|
127 |
+
4 epochs of training (AdamW weight decay = 0.05) with a batch size of 16. Maximum sequence length = 192. Model was fed one sentence at a time.
|
128 |
+
|
129 |
+
<!--- Full config [here](https://wandb.ai/wzkariampuzha/huggingface/runs/353prhts/files/config.yaml). --->
|
130 |
+
|
131 |
+
<!--- THIS IS NOT THE UPDATED RESULTS --->
|
132 |
+
|
133 |
+
<!--- ## Hold-out validation results --->
|
134 |
+
<!--- metric| entity-level result --->
|
135 |
+
<!--- -|- --->
|
136 |
+
<!--- f1 | 83.8 --->
|
137 |
+
<!--- precision | 83.2 --->
|
138 |
+
<!--- recall | 84.5 --->
|
139 |
+
|
140 |
+
<!--- ## Test results --->
|
141 |
+
<!--- | Dataset for Model Training | Evaluation Level | Entity | Precision | Recall | F1 | --->
|
142 |
+
<!--- |:--------------------------:|:----------------:|:------------------:|:---------:|:------:|:-----:| --->
|
143 |
+
<!--- | EpiSet | Entity-Level | Overall | 0.556 | 0.662 | 0.605 | --->
|
144 |
+
<!--- | | | Location | 0.661 | 0.696 | 0.678 | --->
|
145 |
+
<!--- | | | Epidemiologic Type | 0.854 | 0.911 | 0.882 | --->
|
146 |
+
<!--- | | | Epidemiologic Rate | 0.143 | 0.218 | 0.173 | --->
|
147 |
+
<!--- | | Token-Level | Overall | 0.811 | 0.713 | 0.759 | --->
|
148 |
+
<!--- | | | Location | 0.949 | 0.742 | 0.833 | --->
|
149 |
+
<!--- | | | Epidemiologic Type | 0.9 | 0.917 | 0.908 | --->
|
150 |
+
<!--- | | | Epidemiologic Rate | 0.724 | 0.636 | 0.677 | --->
|
151 |
|
152 |
Thanks to [@William Kariampuzha](https://github.com/wzkariampuzha) at Axle Informatics/NCATS for contributing this model.
|