File size: 3,889 Bytes
7478b60
 
 
e3f9752
4a64337
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b44ed8
4fe7a5d
e09d331
4a64337
4fe7a5d
a2ad43f
 
 
 
 
 
 
a0aca24
a2ad43f
4a64337
 
fe39ff5
 
 
 
 
 
 
 
 
 
 
 
4a64337
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb0ed3b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: apache-2.0
---

<h2>GatorTronS overview </h2>

Developed by a joint effort between the University of Florida and NVIDIA, GatorTronS is a clinical language model of 345 million parameters, pre-trained using a BERT architecure implemented in the Megatron package (https://github.com/NVIDIA/Megatron-LM). 

GatorTronS is pre-trained using a dataset consisting of:

- 22B synthetic clinical words generated by GatorTronGPT (a Megatron GPT-3 model)
- 6.1B words from PubMed CC0,
- 2.5B words from WikiText,
- 0.5B words of de-identified clinical notes from MIMIC-III

The Github for GatorTronGPT is at : https://github.com/uf-hobi-informatics-lab/GatorTronGPT

This model is converted to Hugginface from : https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_s

<h2>22B synthetic clinical text description</h2>

We sampled the beginning 15 tokens from all sections of the de-identified notes from the MIMIC III database and generated approximately 8 million prompts. We also tried several random seeds in GatorTronGPT to generate multiple documents from one prompt. We controlled GatorTronGPT to generate a maximum length of 512 tokens. We apply GatorTronGPT to generate a total of 22 billion words of synthetic clinical text. Detailed information is provided in the GatorTronGPT paper: https://arxiv.org/abs/2305.13523


<h2>Model variations</h2>

Model | Parameter 
--- | --- 
[gatortron-base](https://huggingface.co/UFNLP/gatortron-base)| 345 million 
[gatortronS (this model)](https://huggingface.co/UFNLP/gatortronS) | 345 million
[gatortron-medium](https://huggingface.co/UFNLP/gatortron-medium) | 3.9 billion 
[gatortron-large](https://huggingface.co/UFNLP/gatortron-large) | 8.9 billion

<h2>How to use</h2>

```python
from transformers import AutoModel, AutoTokenizer, AutoConfig

tokinizer= AutoTokenizer.from_pretrained('UFNLP/gatortronS')
config=AutoConfig.from_pretrained('UFNLP/gatortronS')
mymodel=AutoModel.from_pretrained('UFNLP/gatortronS')

encoded_input=tokinizer("Bone scan:  Negative for distant metastasis.", return_tensors="pt")
encoded_output = mymodel(**encoded_input)
print (encoded_output)
```

- An NLP pacakge using GatorTronS for clinical concept extraction (Named Entity Recognition): https://github.com/uf-hobi-informatics-lab/ClinicalTransformerNER
- An NLP pacakge using GatorTronS for Relation Extraction: https://github.com/uf-hobi-informatics-lab/ClinicalTransformerRelationExtraction
- An NLP pacakge using GatorTronS for extraction of social determinants of health (SDoH) from clinical narratives: https://github.com/uf-hobi-informatics-lab/SDoH_SODA

<h2>Citation info</h2>

Peng C, Yang X, Chen A, Smith KE, PourNejatian N, Costa AB, Martin C, Flores MG, Zhang Y, Magoc T, Lipori G, Mitchell DA, Ospina NS, Ahmed MM, Hogan WR, Shenkman EA, Guo Y, Bian J, Wu Y†. A Study of Generative Large Language Model for Medical Research and Healthcare. 2023; https://arxiv.org/abs/2305.13523.

- BibTeX entry
```
@ARTICLE{Peng2023-sm,
   title         = "A study of generative large language model for medical
                    research and healthcare",
   author        = "Peng, Cheng and Yang, Xi and Chen, Aokun and Smith, Kaleb E
                    and PourNejatian, Nima and Costa, Anthony B and Martin,
                    Cheryl and Flores, Mona G and Zhang, Ying and Magoc, Tanja
                    and Lipori, Gloria and Mitchell, Duane A and Ospina, Naykky
                    S and Ahmed, Mustafa M and Hogan, William R and Shenkman,
                    Elizabeth A and Guo, Yi and Bian, Jiang and Wu, Yonghui",
   month         =  may,
   year          =  2023,
   copyright     = "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
   archivePrefix = "arXiv",
   primaryClass  = "cs.CL",
   eprint        = "2305.13523"
 }
 
```

<h2>Contact</h2>

- Yonghui Wu: yonghui.wu@ufl.edu
- Cheng Peng: c.peng@ufl.edu