|
--- |
|
license: other |
|
datasets: |
|
- MIMIC-CXR |
|
- NIH-CXR |
|
- CheXpert |
|
library_name: diffusers |
|
extra_gated_prompt: >- |
|
Please confirm that you have read and agree to the following disclaimer. |
|
|
|
The model(s) and/or software described in this repository are provided for research and development use only. The model(s) and/or software are not intended for use in clinical decision-making or for any other clinical use, and performance for clinical use has not been established. You bear sole responsibility for any use of these model(s) and/or software, including incorporation into any product intended for clinical use. |
|
extra_gated_fields: |
|
I have read and agree to the disclaimer: checkbox |
|
--- |
|
|
|
# Model card for RadEdit |
|
|
|
## Model description |
|
|
|
[RadEdit](https://link.springer.com/chapter/10.1007/978-3-031-73254-6_21) is a deep learning approach for stress testing biomedical vision models to discover failure cases. It uses a generative text-to-image model to “edit” chest X-rays by using a text description to add or remove abnormalities from a masked region of the image. These edited images can subsequently be used to test whether existing models (e.g. those for disease classification or anatomy segmentation), perform as expected under these different conditions. |
|
|
|
![RadEdit Banner](./radedit_banner.jpg) |
|
|
|
To enable this, a text-to-image [latent diffusion](https://arxiv.org/abs/2112.10752) model is trained from scratch to generate chest X-rays from either the impression section of a radiology report (a short clinically actionable outline of the main findings) or a list of radiographic observations. |
|
|
|
RadEdit is described in detail in [RadEdit: stress-testing biomedical vision models via diffusion image editing (F. Pérez-García, S. Bond-Taylor, et al., 2024)](https://link.springer.com/chapter/10.1007/978-3-031-73254-6_21). |
|
|
|
We release the weights for the RadEdit model as well as the editing pipeline for stress-testing models. |
|
|
|
- **Developed by:** Microsoft Health Futures |
|
- **Model type:** [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) |
|
- **License:** Model weights in the [unet subfolder](./unet) are licensed under [MSRLA](./unet/LICENSE). Editing pipeline in [pipeline.py](./pipeline.py) is licensed under MIT. |
|
- **Components:** Text encoder and tokenizer: [BioViL-T](https://huggingface.co/microsoft/BiomedVLP-BioViL-T). Autoencoder: [SDXL-VAE](https://huggingface.co/stabilityai/sdxl-vae). |
|
|
|
|
|
## Contents |
|
|
|
* [Model Uses](#uses) |
|
* [Intended Use](#intended-use) |
|
* [Primary Intended Use](#primary-intended-use) |
|
* [Out-of-Scope Use](#out-of-scope-use) |
|
* [Data](#data) |
|
* [MIMIC-CXR](#mimic-cxr) |
|
* [NIH-CXR](#nih-cxr) |
|
* [CheXpert](#chexpert) |
|
* [Biases, Risks and Limitations](#biases-risks-and-limitations) |
|
* [Model Capabilities](#model-capabilities) |
|
* [Getting Started](#getting-started) |
|
* [Sampling Chest X-Rays](#sampling-chest-x-rays) |
|
* [Editing](#editing) |
|
* [Training Details](#training-details) |
|
* [Environmental Impact](#environmental-impact) |
|
* [Compute Infrastructure](#compute-infrastructure) |
|
* [Software](#software) |
|
* [Citation](#citation) |
|
|
|
|
|
## Uses |
|
|
|
### Intended Use |
|
|
|
The model checkpoints are intended to be used solely for (I) future research on chest X-ray generation and model stress-testing and (II) reproducibility of the experimental results reported in the reference paper. The code and model checkpoints should not be used to provide medical or clinical opinions, and is not designed to replace the role of qualified medical professionals in appropriately identifying, assessing, diagnosing or managing medical conditions. Users remain responsible for any outputs generated by the model. |
|
|
|
### Primary Intended Use |
|
|
|
The primary intended use is to support AI researchers reproducing and building on top of this work. RadEdit and its associated models should be helpful for exploring various biomedical stress-testing tasks via image editing or generation. |
|
|
|
### Out-of-Scope Use |
|
|
|
**Any** deployed use case of the model, commercial or otherwise, is out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are intended *for research use only* and not intended for deployed use cases. |
|
|
|
## Data |
|
|
|
RadEdit was trained on the following public deidentified chest X-ray datasets. Only the frontal view chest X-rays are used, totalling 487,680 training images. For [MIMIC-CXR](https://physionet.org/content/mimic-cxr/2.0.0/) the impression section of the radiology report (a short clinically actionable outline of the main findings) is used as the input text to the model. For The [NIH-CXR](https://openaccess.thecvf.com/content_cvpr_2017/html/Wang_ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.html) and [CheXpert](https://aimi.stanford.edu/datasets/chexpert-chest-x-rays), a list of all abnormalities present in an image as indicated by the labels, e.g., “Cardiomegaly. Pneumothorax.” is used as the input text. |
|
|
|
### MIMIC-CXR |
|
|
|
The [MIMIC-CXR](https://physionet.org/content/mimic-cxr/2.0.0/) dataset contains 377,110 image-report pairs from 227,827 radiology studies. A patient may have multiple studies, whereas each study may contain multiple chest x-ray (CXR) images taken at different views. We follow the standard partition and use the first nine subsets (P10-P18) for training and validation, while reserving the last (P19) for testing. |
|
|
|
### NIH-CXR |
|
|
|
The [NIH-CXR](https://openaccess.thecvf.com/content_cvpr_2017/html/Wang_ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.html) dataset contains 112,120 X-ray images with 8 automatically generated disease labels from 30,805 unique patients. Since there is no official validation split, we create a random train/validation split, ensuring that no patient appears in both sets. |
|
|
|
### CheXpert |
|
|
|
The [CheXpert](https://aimi.stanford.edu/datasets/chexpert-chest-x-rays) dataset contains 224,316 chest X-ray images from 65,240 patients together with automatically generated labels indicating the presence of 14 observations in radiology reports. We use the official train/validation split. |
|
|
|
|
|
## Biases, Risks and Limitations |
|
|
|
The model was developed using English corpora, and thus may be considered English-only. The model is evaluated on a narrow set of biomedical benchmark tasks, described in the [RadEdit paper]( https://arxiv.org/abs/2312.12865). As such, it is not suitable for use in any clinical setting. Under some conditions, the model may make inaccurate predictions and display limitations, which may require additional mitigation strategies. In particular, the model is likely to carry many of the limitations of the models from which it is derived, [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [BioViL-T]( https://huggingface.co/microsoft/BiomedVLP-BioViL-T), and [SDXL-VAE](https://huggingface.co/stabilityai/sdxl-vae). In particular, the SDXL-VAE (which is used to compress images prior to training the diffusion model) can exhibit artefacts in its reconstructions which can make generated images identifiable from real images. See Figure 12 in [Taming Transformers for High-Resolution Image Synthesis](https://arxiv.org/abs/2012.09841v3) for examples of such artefacts. While evaluation has included clinical input, this is not exhaustive; model performance will vary in different settings and is intended for research use only. |
|
|
|
Further, the model inherits the biases from the training datasets. These datasets come from hospitals in the United States; therefore, it might be biased towards population in the training data. Underlying biases of the training datasets may not be well characterized. A substantial proportion of the training data comes from inpatient medical record; samples from the model are thus reflective of this population. Due to the automated procedure used to obtain pathology labels, erroneous labels may have been used to train the model, which may affect its performance. |
|
|
|
The RadEdit editing pipeline is not applicable to all stress testing scenarios. For example, testing segmentation models’ behaviour to cardiomegaly (enlarged heart) is not possible as this would require segmentation masks to be changed. Other limitations of the editing procedure are discussed in the [RadEdit paper]( https://arxiv.org/abs/2312.12865). |
|
|
|
Other limitations: |
|
|
|
* The model does not achieve perform photorealism. |
|
* Model outputs may include errors. |
|
* The model can fail to produce aligned outputs for more complex prompts. |
|
* The model can fail to produce outputs matching the text input; particularly if the text differs substantially from the training data. |
|
* When using the model for image editing, unwanted visual changes may be made. |
|
|
|
|
|
## Getting Started |
|
|
|
This repository provides the weights for the U-Net model. The VAE, text encoder, tokenizer, and scheduler have to be loaded separately |
|
and combined into the generation pipeline: |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
from diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPipeline, UNet2DConditionModel |
|
|
|
# Load the UNet model |
|
unet_loaded = UNet2DConditionModel.from_pretrained("microsoft/radedit", subfolder="unet") |
|
|
|
# Load all other components of the stable diffusion pipeline |
|
vae = AutoencoderKL.from_pretrained("stabilityai/sdxl-vae") |
|
text_encoder = AutoModel.from_pretrained( |
|
"microsoft/BiomedVLP-BioViL-T", |
|
trust_remote_code=True, |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
"microsoft/BiomedVLP-BioViL-T", |
|
model_max_length=128, |
|
trust_remote_code=True, |
|
) |
|
scheduler = DDIMScheduler( |
|
beta_schedule="linear", |
|
clip_sample=False, |
|
prediction_type="epsilon", |
|
timestep_spacing="trailing", |
|
steps_offset=1, |
|
) |
|
|
|
generation_pipeline = StableDiffusionPipeline( |
|
vae=vae, |
|
text_encoder=text_encoder, |
|
tokenizer=tokenizer, |
|
unet=unet_loaded, |
|
scheduler=scheduler, |
|
safety_checker=None, |
|
requires_safety_checker=False, |
|
feature_extractor=None, |
|
) |
|
generation_pipeline.to("cuda") |
|
``` |
|
|
|
### Sampling Chest X-Rays |
|
|
|
The generation pipeline can be used to sample images via the following |
|
|
|
```python |
|
import torch |
|
|
|
prompts = [ |
|
"Small right-sided pleural effusion", |
|
"No acute cardiopulmonary process", |
|
"Small left-sided pleural effusion", |
|
"Large right-sided pleural effusion", |
|
"Bilateral pleural effusions", |
|
"Large left-sided pleural effusion", |
|
] |
|
|
|
torch.manual_seed(0) |
|
images = generation_pipeline( |
|
prompts, |
|
num_inference_steps=100, |
|
guidance_scale=7.5, |
|
).images |
|
``` |
|
|
|
![RadEdit Samples](./radedit_samples.png) |
|
|
|
### Editing |
|
|
|
To load the RadEdit editing pipeline, we convert the generation pipeline into the custom pipeline in [pipeline.py](./pipeline.py) |
|
|
|
```python |
|
from diffusers import DiffusionPipeline |
|
radedit_pipeline = DiffusionPipeline.from_pipe( |
|
pipeline, |
|
custom_pipeline="microsoft/radedit", |
|
) |
|
``` |
|
|
|
Following this, RadEdit can be used to edit an `input_image` using two masks: the `edit_mask` which defined the region we wish the editing prompt to be applied to, and the `fixed_mask` which defined the region where any edits are prevented from taking place. |
|
|
|
```python |
|
prompt = 'No acute cardiopulmonary process' |
|
arrays = radedit_pipeline_loaded( |
|
prompt, |
|
weights=[7.5], |
|
image=input_img, |
|
edit_mask=input_mask, |
|
keep_mask=fixed_mask, |
|
num_inference_steps=200, |
|
invert_prompt='', |
|
skip_ratio=0.3, |
|
) |
|
``` |
|
|
|
## Training details |
|
|
|
We train the U-Net for 300 epochs, monitoring validation loss to avoid overfitting. During training we regularly evaluate a number of different metrics which assess the quality, diversity and alignment between prompt and generation, including FID, precision/recall/density/coverage, and CLIP score to ensure that samples are high quality and diverse. |
|
|
|
### Environmental impact |
|
|
|
- **Hardware type:** NVIDIA V100 GPUs |
|
- **Hours used:** 318 hours/GPU × 1 nodes × 8 GPUs/node = 2544 GPU-hours |
|
- **Cloud provider:** Azure |
|
- **Compute region:** West US 2 |
|
- **Carbon emitted:** 229 kg CO₂ eq. |
|
|
|
### Compute infrastructure |
|
|
|
RadEdit was trained on [Azure Machine Learning](https://azure.microsoft.com/en-us/products/machine-learning). |
|
|
|
### Software |
|
|
|
We used [SimpleITK](https://simpleitk.org/) and [Pydicom](https://pydicom.github.io/) for processing of DICOM files. |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
```bibtex |
|
@inproceedings{perez-garcia_bond-taylor_radedit, |
|
title = {{RadEdit}: Stress-Testing Biomedical Vision Models via Diffusion Image Editing}, |
|
author = {P{\'e}rez-Garc{\'i}a, Fernando and Bond-Taylor, Sam and Sanchez, Pedro P. and van Breugel, Boris and Castro, Daniel C. and Sharma, Harshita and Salvatelli, Valentina and Wetscherek, Maria T. A. and Richardson, Hannah and Lungren, Matthew P. and Nori, Aditya and Alvarez-Valle, Javier and Oktay, Ozan and Ilse, Maximilian}, |
|
year = 2025, |
|
booktitle = {Computer Vision -- ECCV 2024}, |
|
publisher = {Springer Nature Switzerland}, |
|
address = {Cham}, |
|
pages = {358--376}, |
|
isbn = {978-3-031-73254-6}, |
|
editor = {Leonardis, Ale{\v{s}} and Ricci, Elisa and Roth, Stefan and Russakovsky, Olga and Sattler, Torsten and Varol, G{\"u}l}, |
|
abstract = {Biomedical imaging datasets are often small and biased, meaning that real-world performance of predictive models can be substantially lower than expected from internal testing. This work proposes using generative image editing to simulate dataset shifts and diagnose failure modes of biomedical vision models; this can be used in advance of deployment to assess readiness, potentially reducing cost and patient harm. Existing editing methods can produce undesirable changes, with spurious correlations learned due to the co-occurrence of disease and treatment interventions, limiting practical applicability. To address this, we train a text-to-image diffusion model on multiple chest X-ray datasets and introduce a new editing method, RadEdit, that uses multiple image masks, if present, to constrain changes and ensure consistency in the edited images, minimising bias. We consider three types of dataset shifts: acquisition shift, manifestation shift, and population shift, and demonstrate that our approach can diagnose failures and quantify model robustness without additional data collection, complementing more qualitative tools for explainable AI.} |
|
} |
|
|
|
``` |
|
|
|
**APA:** |
|
|
|
> Pérez-García, F., Bond-Taylor, S., Sanchez, P. P., van Breugel, B., Castro, D. C., Sharma, H., … Ilse, M. (2025). RadEdit: Stress-Testing Biomedical Vision Models via Diffusion Image Editing. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Computer Vision -- ECCV 2024 (pp. 358–376). Cham: Springer Nature Switzerland. |
|
|
|
## Model card contact |
|
|
|
Sam Bond-Taylor ([`sbondtaylor@microsoft.com`](mailto:sbondtaylor@microsoft.com)). |
|
|