|
--- |
|
license: cc-by-nc-4.0 |
|
library_name: clmbr |
|
tags: |
|
- healthcare |
|
- femr |
|
- medical |
|
extra_gated_prompt: "You agree to all terms outlined in 'The EHRSHOT Credentialed Health Data License' (see https://shahlab.stanford.edu/ehrshot_license). Access requires a verified CITI training certificate using the same process outlined by PhysioNet (see https://physionet.org/about/citi-course/) Please provide proof via the verification URL, which takes the form https://www.citiprogram.org/verify/?XXXXXX. You agree to not use the model to conduct experiments that cause harm to human subjects." |
|
extra_gated_fields: |
|
Full Name: text |
|
Email: text |
|
Affiliation: text |
|
CITI Certification Verification URL: text |
|
I agree to all terms outlined in 'The EHRSHOT Credentialed Health Data License': checkbox |
|
I agree to use this model for non-commercial use ONLY: checkbox |
|
--- |
|
|
|
# CLMBR-T-Base |
|
|
|
This is a 141 million parameter autoregressive foundation model pretrained on 2.57 million deidentified EHRs from Stanford Medicine. |
|
|
|
This is the model from [(Wornow et al. 2023)](https://arxiv.org/abs/2307.02028), and is based on the CLMBR architecture originally described in [(Steinberg et al. 2021)](https://www.sciencedirect.com/science/article/pii/S1532046420302653) |
|
|
|
As input, this model expects a sequence of coded medical events that have been mapped to Standard Concepts within the [OMOP-CDM vocabulary](https://ohdsi.github.io/CommonDataModel/index.html). The model generates representations of patients which can then be used for downstream prediction tasks. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** Shah lab @ Stanford University |
|
- **Funded by:** Stanford Healthcare |
|
- **Shared by:** Shah lab @ Stanford University |
|
- **Model type:** CLMBR [(Steinberg et al. 2021)](https://www.sciencedirect.com/science/article/pii/S1532046420302653) |
|
- **Language(s) (NLP):** Electronic health record codes |
|
- **License:** CC-BY NC 4.0 |
|
- **Finetuned from model:** N/A -- trained from scratch |
|
|
|
### Model Sources |
|
|
|
- **Website:** [https://ehrshot.stanford.edu/](https://ehrshot.stanford.edu/) |
|
- **Gitub:** [https://github.com/som-shahlab/ehrshot-benchmark/](https://github.com/som-shahlab/ehrshot-benchmark/) |
|
- **Paper:** [EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models](https://arxiv.org/abs/2307.02028) |
|
|
|
## Uses |
|
|
|
This model is intended to generate representations for patients based on the structured data within their electronic health record. |
|
These representations can then be used for downstream tasks such as predicting diagnoses, detecting anomalies, or doing propensity score matching for causal inference. |
|
|
|
### Direct Use |
|
|
|
You will likely want to tune the model for your downstream use case. |
|
|
|
### Out-of-Scope Use |
|
|
|
This model is for research purposes only. It is not for use in any real-world decision making that impacts patients, providers, or hospital operations. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
This model was trained on a corpus of 2.57 million patients from Stanford Medicine. |
|
The model will thus reflect the patterns of how care is delivered at Stanford Medicine, in addition to the racial and socioeconomic makeup of Stanford Medicine's patient base. |
|
This model may not generalize well to other hospitals and demographic mixes. |
|
|
|
While this is technically a generative model, we have not tested its generative abilities and thus do not anticipate it being used to generate synthetic EHR records. |
|
We aim to explore its generative abilities in future work. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
First, download the necessary libraries. |
|
```bash |
|
# Create Python 3.10 environment |
|
conda create --name ehrshot_env python=3.10 -y |
|
conda activate ehrshot_env |
|
|
|
# Install requirements |
|
pip install torch==2.1.1 femr==0.2.0 datasets==2.15.0 flash_attn==2.3.6 transformers==4.35.2 |
|
``` |
|
|
|
Second, run the following Python script to run inference on a single patient: |
|
```python |
|
import femr.models.transformer |
|
import torch |
|
import femr.models.tokenizer |
|
import femr.models.processor |
|
import datetime |
|
|
|
model_name = "StanfordShahLab/clmbr-t-base" |
|
|
|
# Load tokenizer / batch loader |
|
tokenizer = femr.models.tokenizer.FEMRTokenizer.from_pretrained(model_name) |
|
batch_processor = femr.models.processor.FEMRBatchProcessor(tokenizer) |
|
|
|
# Load model |
|
model = femr.models.transformer.FEMRModel.from_pretrained(model_name) |
|
|
|
# Create an example patient to run inference on |
|
example_patient = { |
|
'patient_id': 30, |
|
'events': [{ |
|
'time': datetime.datetime(2011, 5, 8), |
|
'measurements': [ |
|
{'code': 'SNOMED/1'}, |
|
], |
|
}, |
|
{ |
|
'time': datetime.datetime(2012, 6, 9), |
|
'measurements': [ |
|
{'code': 'SNOMED/30'}, |
|
{'code': 'SNOMED/103'} |
|
], |
|
}] |
|
} |
|
batch = batch_processor.convert_patient(example_patient, tensor_type="pt") |
|
|
|
# Run model |
|
with torch.no_grad(): |
|
patient_ids, times, reprs = model(batch) |
|
print(patient_ids) |
|
print(times) |
|
print(reprs) |
|
``` |
|
|
|
## Training Details |
|
|
|
Full training details are provided in our accompanying paper, [EHRSHOT (Wornow et al. 2023)](https://arxiv.org/abs/2307.02028). |
|
|
|
### Training Data |
|
|
|
The model is trained on 2.57 million patients from the [Stanford Medicine Research Data Repository (STARR)](https://academic.oup.com/jamiaopen/article/6/3/ooad054/7236015), which contains EHR data from both Stanford Health Care (primarily adult care) |
|
and Lucile Packard Children’s Hospital (primarily pediatric care). |
|
The dataset contains only structured data (i.e. no clinical text or images) and covers demographics (e.g. age, sex, race), diagnoses, procedures, laboratory results, medication prescriptions, and other coded clinical observations. |
|
The data is formatted according to the [Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM)](https://ohdsi.github.io/CommonDataModel/cdm53.html). |
|
All data that we work with is deidentified. |
|
|
|
### Training Procedure |
|
|
|
We train our model using an autoregressive next code prediction objective, i.e. predict the next code in a patient's timeline given their previous codes. |
|
|
|
#### Preprocessing |
|
|
|
We use the [FEMR](https://github.com/som-shahlab/femr/tree/main) Python library for data preprocessing. |
|
|
|
#### Training Hyperparameters |
|
|
|
* Learning rate: 0.00001 |
|
* Context window size: 496 |
|
* Internal dropout: 0 |
|
* Layers: 12 |
|
* Hidden dimension: 768 |
|
|
|
## Evaluation |
|
|
|
We evaluate this model on [the EHRSHOT benchmark](https://ehrshot.stanford.edu). |
|
|
|
Information on this benchmark, tasks, and results are detailed in [Wornow et al. 2023](https://arxiv.org/pdf/2307.02028.pdf) |
|
|
|
## Technical Specifications |
|
|
|
This model uses the CLMBR architecture from [(Steinberg et al. 2021)](https://www.sciencedirect.com/science/article/pii/S1532046420302653). |
|
The objective is an autoregressive next token prediction task. |
|
Please see [Wornow et al. 2023](https://arxiv.org/pdf/2307.02028.pdf) for more details on the specific model architecture. |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
``` |
|
@article{wornow2023ehrshot, |
|
title={EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models}, |
|
author={Michael Wornow and Rahul Thapa and Ethan Steinberg and Jason Fries and Nigam Shah}, |
|
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, |
|
year={2023} |
|
} |
|
``` |
|
## Model Card Authors |
|
|
|
Michael Wornow, Ethan Steinberg, Rahul Thapa, Jason Fries, Nigam H. Shah |
|
|
|
## Model Card Contact |
|
|
|
Michael Wornow (mwornow@stanford.edu) |