|
--- |
|
license: cc-by-nc-4.0 |
|
library_name: clmbr |
|
tags: |
|
- healthcare |
|
- femr |
|
- medical |
|
extra_gated_prompt: "You agree to all terms outlined in 'The EHRSHOT Credentialed Health Data License' (see https://shahlab.stanford.edu/ehrshot_license). Access requires a verified CITI training certificate using the same process outlined by PhysioNet (see https://physionet.org/about/citi-course/). Please complete the 'Data or Specimens Only Research' course and please provide proof via the verification URL, which takes the form https://www.citiprogram.org/verify/?XXXXXX. You agree to not use the model to conduct experiments that cause harm to human subjects." |
|
extra_gated_fields: |
|
Full Name: text |
|
Email: text |
|
Affiliation: text |
|
CITI Certification Verification URL: text |
|
I agree to all terms outlined in 'The EHRSHOT Credentialed Health Data License': checkbox |
|
I agree to use this model for non-commercial use ONLY: checkbox |
|
--- |
|
|
|
# CLMBR-T-Base |
|
|
|
This is a 141 million parameter autoregressive foundation model pretrained on 2.57 million deidentified EHRs from Stanford Medicine. |
|
|
|
This is the model from [(Wornow et al. 2023)](https://arxiv.org/abs/2307.02028), and is based on the CLMBR architecture originally described in [(Steinberg et al. 2021)](https://www.sciencedirect.com/science/article/pii/S1532046420302653) |
|
|
|
As input, this model expects a sequence of coded medical events that have been mapped to Standard Concepts within the [OMOP-CDM vocabulary](https://ohdsi.github.io/CommonDataModel/index.html). The model generates representations of patients which can then be used for downstream prediction tasks. |
|
|
|
Input patients should be provided in the [MEDS](https://github.com/Medical-Event-Data-Standard/) schema. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** Shah lab @ Stanford University |
|
- **Funded by:** Stanford Healthcare |
|
- **Shared by:** Shah lab @ Stanford University |
|
- **Model type:** CLMBR [(Steinberg et al. 2021)](https://www.sciencedirect.com/science/article/pii/S1532046420302653) |
|
- **Language(s) (NLP):** Electronic health record codes |
|
- **License:** CC-BY NC 4.0 |
|
- **Finetuned from model:** N/A -- trained from scratch |
|
|
|
### Model Sources |
|
|
|
- **Website:** [https://ehrshot.stanford.edu/](https://ehrshot.stanford.edu/) |
|
- **Gitub:** [https://github.com/som-shahlab/ehrshot-benchmark/](https://github.com/som-shahlab/ehrshot-benchmark/) |
|
- **Paper:** [EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models](https://arxiv.org/abs/2307.02028) |
|
|
|
## Uses |
|
|
|
This model is intended to generate representations for patients based on the structured data within their electronic health record. |
|
These representations can then be used for downstream tasks such as predicting diagnoses, detecting anomalies, or doing propensity score matching for causal inference. |
|
|
|
### Direct Use |
|
|
|
You will likely want to tune the model for your downstream use case. |
|
|
|
### Out-of-Scope Use |
|
|
|
This model is for research purposes only. It is not for use in any real-world decision making that impacts patients, providers, or hospital operations. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
This model was trained on a corpus of 2.57 million patients from Stanford Medicine. |
|
The model will thus reflect the patterns of how care is delivered at Stanford Medicine, in addition to the racial and socioeconomic makeup of Stanford Medicine's patient base. |
|
This model may not generalize well to other hospitals and demographic mixes. |
|
|
|
While this is technically a generative model, we have not tested its generative abilities and thus do not anticipate it being used to generate synthetic EHR records. |
|
We aim to explore its generative abilities in future work. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
First, download the necessary libraries. |
|
```bash |
|
pip install torch==2.1.1 femr==0.2.3 datasets==2.15.0 xformers transformers==4.35.2 |
|
``` |
|
|
|
Second, run the following Python script to run inference on a single patient: |
|
```python |
|
import femr.models.transformer |
|
import torch |
|
import femr.models.tokenizer |
|
import femr.models.processor |
|
import datetime |
|
|
|
model_name = "StanfordShahLab/clmbr-t-base" |
|
|
|
# Load tokenizer / batch loader |
|
tokenizer = femr.models.tokenizer.FEMRTokenizer.from_pretrained(model_name) |
|
batch_processor = femr.models.processor.FEMRBatchProcessor(tokenizer) |
|
|
|
# Load model |
|
model = femr.models.transformer.FEMRModel.from_pretrained(model_name) |
|
|
|
# Create an example patient to run inference on |
|
# This patient follows the MEDS schema: https://github.com/Medical-Event-Data-Standard |
|
example_patient = { |
|
'patient_id': 30, |
|
'events': [{ |
|
'time': datetime.datetime(2011, 5, 8), |
|
'measurements': [ |
|
{'code': 'SNOMED/184099003'}, |
|
{'code': 'Visit/IP'}, |
|
], |
|
}, |
|
{ |
|
'time': datetime.datetime(2012, 6, 9), |
|
'measurements': [ |
|
{'code': 'Visit/OP'}, |
|
{'code': 'SNOMED/3950001'} |
|
], |
|
}] |
|
} |
|
|
|
raw_batch = batch_processor.convert_patient(example_patient, tensor_type="pt") |
|
batch = batch_processor.collate([raw_batch]) |
|
|
|
# Run model |
|
with torch.no_grad(): |
|
_, result = model(**batch) |
|
print(result['timestamps'].cpu().numpy().astype('datetime64[s]')) |
|
print(result['patient_ids']) |
|
print(result['representations']) |
|
``` |
|
|
|
## Training Details |
|
|
|
Full training details are provided in our accompanying paper, [EHRSHOT (Wornow et al. 2023)](https://arxiv.org/abs/2307.02028). |
|
|
|
### Training Data |
|
|
|
The model is trained on 2.57 million patients from the [Stanford Medicine Research Data Repository (STARR)](https://academic.oup.com/jamiaopen/article/6/3/ooad054/7236015), which contains EHR data from both Stanford Health Care (primarily adult care) |
|
and Lucile Packard Children’s Hospital (primarily pediatric care). |
|
The dataset contains only structured data (i.e. no clinical text or images) and covers demographics (e.g. age, sex, race), diagnoses, procedures, laboratory results, medication prescriptions, and other coded clinical observations. |
|
The data is formatted according to the [Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM)](https://ohdsi.github.io/CommonDataModel/cdm53.html). |
|
All data that we work with is deidentified. |
|
|
|
### Training Procedure |
|
|
|
We train our model using an autoregressive next code prediction objective, i.e. predict the next code in a patient's timeline given their previous codes. |
|
|
|
#### Preprocessing |
|
|
|
We use the [FEMR](https://github.com/som-shahlab/femr/tree/main) Python library for data preprocessing. |
|
|
|
#### Training Hyperparameters |
|
|
|
* Learning rate: 0.00001 |
|
* Context window size: 496 |
|
* Internal dropout: 0 |
|
* Layers: 12 |
|
* Hidden dimension: 768 |
|
|
|
## Evaluation |
|
|
|
We evaluate this model on [the EHRSHOT benchmark](https://ehrshot.stanford.edu). |
|
|
|
Information on this benchmark, tasks, and results are detailed in [Wornow et al. 2023](https://arxiv.org/pdf/2307.02028.pdf) |
|
|
|
## Technical Specifications |
|
|
|
This model uses the CLMBR architecture from [(Steinberg et al. 2021)](https://www.sciencedirect.com/science/article/pii/S1532046420302653). |
|
The objective is an autoregressive next token prediction task. |
|
Please see [Wornow et al. 2023](https://arxiv.org/pdf/2307.02028.pdf) for more details on the specific model architecture. |
|
|
|
## Vocabulary |
|
|
|
CLMBR is a language model and requires defining a token vocabulary `V`. However, unlike natural languages, the vocabulary of a structured EHR language model is defined by *medical codes*. Here tokens map to standardized concepts in medical ontologies. Since the union of all tokens from all ontologies, `V_all`, results in a prohibitively large vocabuary, we derive `~V` by filtering to the top `k` most frequent codes as follows: |
|
|
|
1. **Knowledge Graphs (G):** A set of `n` medical ontologies (knowledge graphs), `G = ({G_1, G_2, ..., G_n})`, defined by [Athena's OMOP Vocabulary List](https://athena.ohdsi.org/vocabulary/list). |
|
2. **Medical Codes as Tokens:** Each knowledge graph `G_i` has a set of unique medical codes `M_i`. The union of all these codes serve as the tokens in our complete vocabulary `V_all = M_1 ∪ M_2 ∪ ... ∪ M_n`. Our final, filtered vocabulary is then `~V = sort_freq(V_all)[1:k]` where frequency is calculated over our [STARR EHR OMOP](https://academic.oup.com/jamiaopen/article/6/3/ooad054/7236015) dataset. |
|
|
|
|
|
**CLMBR Vocabulary Summary** |
|
|
|
- 21 Source Ontologies/Knowledge Graphs |
|
- 65,536 tokens (the max value of `uint16_t`) |
|
|
|
|
|
| PREFIX | SOURCE | SIZE | EXAMPLE TOKENS | |
|
|:---------------------|:-------------------------------------------------------------------------------------------------|---------:|:---------------------------------------------------| |
|
| LOINC | Logical Observation Identifiers Names and Codes (Regenstrief Institute) | 37,590 | 31790-9, 20449-5 | |
|
| SNOMED | Systematic Nomenclature of Medicine - Clinical Terms (IHTSDO) | 18,174 | 105013009, 200755008 | |
|
| RxNorm | RxNorm (NLM) | 4,678 | 2375327, 372375 | |
|
| CPT4 | Current Procedural Terminology version 4 (AMA) | 3,730 | 00790, 36818 | |
|
| RxNorm Extension | OMOP RxNorm Extension | 255 | OMOP358911, OMOP2153393 | |
|
| ICD10PCS | ICD-10 Procedure Coding System (CMS) | 233 | 10907ZC, 4A0234Z | |
|
| ICD9Proc | International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 3 (NCHS) | 196 | 68.29, 03.93 | |
|
| Cancer Modifier | Diagnostic Modifiers of Cancer (OMOP) | 88 | c-8th\_AJCC/UICC-Stage-2C, p-7th\_AJCC/UICC-Stage-3B | |
|
| HCPCS | Healthcare Common Procedure Coding System (CMS) | 54 | C1878, P7001 | |
|
| ICDO3 | International Classification of Diseases for Oncology, Third Edition (WHO) | 52 | NULL-C34.8, C56.9 | |
|
| CVX | CDC Vaccine Administered CVX (NCIRD) | 41 | 151, 158 | |
|
| Domain | OMOP | 27 | OMOP generated | |
|
| Race | Race and Ethnicity Code Set (USBC) | 5 | 5, 4 | |
|
| OMOP Extension | OMOP Extension (OHDSI) | 3 | OMOP5160861, OMOP4912978 | |
|
| Gender | OMOP Gender | 2 | F, M | |
|
| Ethnicity | OMOP Ethnicity | 2 | Not Hispanic, Hispanic | |
|
| CMS Place of Service | Place of Service Codes for Professional Claims (CMS) | 2 | OMOP4822036, 02 | |
|
| Medicare Specialty | Medicare provider/supplier specialty codes (CMS) | 1 | A0 | |
|
| Condition Type | OMOP | 1 | OMOP4822053 | |
|
| CARE_SITE | STANFORD_CUSTOM | 396 | 7930934, 7929373 | |
|
| Visit | STANFORD_CUSTOM | 6 | ERIP, ER | |
|
|
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
``` |
|
@article{wornow2023ehrshot, |
|
title={EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models}, |
|
author={Michael Wornow and Rahul Thapa and Ethan Steinberg and Jason Fries and Nigam Shah}, |
|
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, |
|
year={2023} |
|
} |
|
``` |
|
## Model Card Authors |
|
|
|
Michael Wornow, Ethan Steinberg, Rahul Thapa, Jason Fries, Nigam H. Shah |
|
|
|
## Model Card Contact |
|
|
|
Michael Wornow (mwornow@stanford.edu) |
|
|