wzkariampuzha commited on
Commit
1f7bd2d
·
1 Parent(s): e172865

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -22
README.md CHANGED
@@ -50,7 +50,7 @@ license: other
50
  ## DOCUMENTATION UPDATES IN PROGRESS
51
 
52
  ## Model description
53
- **EpiExtract4GARD** is a fine-tuned [BioBERT-base-cased](https://huggingface.co/dmis-lab/biobert-base-cased-v1.1) model that is ready to use for **Named Entity Recognition** of locations (LOC), epidemiologic types (EPI), and epidemiologic rates (STAT). This model was fine-tuned on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER) for epidemiological information from rare disease abstracts. See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations. See [EpiExtract4GARD on GitHub](https://github.com/ncats/epi4GARD/tree/master/EpiExtract4GARD#epiextract4gard) for details on the entire pipeline.
54
 
55
  #### How to use
56
  You can use this model with the Hosted inference API to the right with this [test sentence](https://pubmed.ncbi.nlm.nih.gov/21659675/): "27 patients have been diagnosed with PKU in Iceland since 1947. Incidence 1972-2008 is 1/8400 living births."
@@ -116,32 +116,37 @@ B-EPI | Beginning of an epidemiologic type (e.g. "incidence", "prevalence", "
116
  I-EPI | Epidemiologic type that is not the beginning token.
117
  B-STAT | Beginning of an epidemiologic rate
118
  I-STAT | Inside of an epidemiologic rate
 
119
 
120
  ### EpiSet Statistics
121
 
122
- Beyond any limitations due to the EpiSet4NER dataset, this model is limited in numeracy due to BERT-based model's use of subword embeddings, which is crucial for epidemiologic rate identification and limits the entity-level results. Additionally, more recent weakly supervised learning techniques could be used to improve the performance of the model without improving the underlying dataset.
123
 
124
  ## Training procedure
125
  This model was trained on a [AWS EC2 p3.2xlarge](https://aws.amazon.com/ec2/instance-types/), which utilized a single Tesla V100 GPU, with these hyperparameters:
126
- 4 epochs of training (AdamW weight decay = 0.05) with a batch size of 16. Maximum sequence length = 192. Model was fed one sentence at a time. Full config [here](https://wandb.ai/wzkariampuzha/huggingface/runs/353prhts/files/config.yaml).
127
-
128
- ## Hold-out validation results
129
- metric| entity-level result
130
- -|-
131
- f1 | 83.8
132
- precision | 83.2
133
- recall | 84.5
134
-
135
- ## Test results
136
- | Dataset for Model Training | Evaluation Level | Entity | Precision | Recall | F1 |
137
- |:--------------------------:|:----------------:|:------------------:|:---------:|:------:|:-----:|
138
- | EpiSet | Entity-Level | Overall | 0.556 | 0.662 | 0.605 |
139
- | | | Location | 0.661 | 0.696 | 0.678 |
140
- | | | Epidemiologic Type | 0.854 | 0.911 | 0.882 |
141
- | | | Epidemiologic Rate | 0.143 | 0.218 | 0.173 |
142
- | | Token-Level | Overall | 0.811 | 0.713 | 0.759 |
143
- | | | Location | 0.949 | 0.742 | 0.833 |
144
- | | | Epidemiologic Type | 0.9 | 0.917 | 0.908 |
145
- | | | Epidemiologic Rate | 0.724 | 0.636 | 0.677 |
 
 
 
 
146
 
147
  Thanks to [@William Kariampuzha](https://github.com/wzkariampuzha) at Axle Informatics/NCATS for contributing this model.
 
50
  ## DOCUMENTATION UPDATES IN PROGRESS
51
 
52
  ## Model description
53
+ **EpiExtract4GARD-v2** is a fine-tuned [BioBERT-base-cased](https://huggingface.co/dmis-lab/biobert-base-cased-v1.1) model that is ready to use for **Named Entity Recognition** of locations (LOC), epidemiologic types (EPI), and epidemiologic rates (STAT). This model was fine-tuned on EpiSet4NER-v2 for epidemiological information from rare disease abstracts. See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations. See [EpiExtract4GARD on GitHub](https://github.com/ncats/epi4GARD/tree/master/EpiExtract4GARD#epiextract4gard) for details on the entire pipeline.
54
 
55
  #### How to use
56
  You can use this model with the Hosted inference API to the right with this [test sentence](https://pubmed.ncbi.nlm.nih.gov/21659675/): "27 patients have been diagnosed with PKU in Iceland since 1947. Incidence 1972-2008 is 1/8400 living births."
 
116
  I-EPI | Epidemiologic type that is not the beginning token.
117
  B-STAT | Beginning of an epidemiologic rate
118
  I-STAT | Inside of an epidemiologic rate
119
+ +More | Description pending
120
 
121
  ### EpiSet Statistics
122
 
123
+ Beyond any limitations due to the EpiSet4NER dataset, this model is limited in numeracy due to BERT-based model's use of subword embeddings, which is crucial for epidemiologic rate identification and limits the entity-level results. Recent techniques in numeracy could be used to improve the performance of the model without improving the underlying dataset.
124
 
125
  ## Training procedure
126
  This model was trained on a [AWS EC2 p3.2xlarge](https://aws.amazon.com/ec2/instance-types/), which utilized a single Tesla V100 GPU, with these hyperparameters:
127
+ 4 epochs of training (AdamW weight decay = 0.05) with a batch size of 16. Maximum sequence length = 192. Model was fed one sentence at a time.
128
+
129
+ <!--- Full config [here](https://wandb.ai/wzkariampuzha/huggingface/runs/353prhts/files/config.yaml). --->
130
+
131
+ <!--- THIS IS NOT THE UPDATED RESULTS --->
132
+
133
+ <!--- ## Hold-out validation results --->
134
+ <!--- metric| entity-level result --->
135
+ <!--- -|- --->
136
+ <!--- f1 | 83.8 --->
137
+ <!--- precision | 83.2 --->
138
+ <!--- recall | 84.5 --->
139
+
140
+ <!--- ## Test results --->
141
+ <!--- | Dataset for Model Training | Evaluation Level | Entity | Precision | Recall | F1 | --->
142
+ <!--- |:--------------------------:|:----------------:|:------------------:|:---------:|:------:|:-----:| --->
143
+ <!--- | EpiSet | Entity-Level | Overall | 0.556 | 0.662 | 0.605 | --->
144
+ <!--- | | | Location | 0.661 | 0.696 | 0.678 | --->
145
+ <!--- | | | Epidemiologic Type | 0.854 | 0.911 | 0.882 | --->
146
+ <!--- | | | Epidemiologic Rate | 0.143 | 0.218 | 0.173 | --->
147
+ <!--- | | Token-Level | Overall | 0.811 | 0.713 | 0.759 | --->
148
+ <!--- | | | Location | 0.949 | 0.742 | 0.833 | --->
149
+ <!--- | | | Epidemiologic Type | 0.9 | 0.917 | 0.908 | --->
150
+ <!--- | | | Epidemiologic Rate | 0.724 | 0.636 | 0.677 | --->
151
 
152
  Thanks to [@William Kariampuzha](https://github.com/wzkariampuzha) at Axle Informatics/NCATS for contributing this model.