File size: 7,186 Bytes
b837d07 7330aee 364c933 7330aee b837d07 be7e653 b837d07 be7e653 9b53c6b 65202e6 a9f9698 be7e653 65202e6 a82c30c 65202e6 be7e653 448e188 f579cef be7e653 f579cef 027e574 f579cef e9f687c f579cef dd706b8 f579cef dd706b8 f579cef dd706b8 f579cef dd706b8 f579cef dd706b8 f579cef dd706b8 f579cef be7e653 4ace6a2 be7e653 4ace6a2 be7e653 4ace6a2 be7e653 448e188 4ace6a2 be7e653 448e188 be7e653 448e188 be7e653 448e188 be7e653 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
license: mit
tags:
- generation
- text-to-image
- image-variation
- image-to-text
- image-editing
- vision
datasets:
- Laion2B-en
widget:
- text: "A high tech solarpunk utopia in the Amazon rainforest"
example_title: Amazon rainforest
---
# Versatile Diffusion V1.0 Model Card
We built **Versatile Diffusion (VD), the first unified multi-flow multimodal diffusion framework**, as a step towards **Universal Generative AI**. Versatile Diffusion can natively support image-to-text, image-variation, text-to-image, and text-variation, and can be further extended to other applications such as semantic-style disentanglement, image-text dual-guided generation, latent image-to-text-to-image editing, and more. Future versions will support more modalities such as speech, music, video and 3D.
Resources for more information: [GitHub](https://github.com/SHI-Labs/Versatile-Diffusion), [arXiv](https://arxiv.org/abs/2211.08332).
# Model Details
One single flow of Versatile Diffusion contains a VAE, a diffuser, and a context encoder, and thus handles one task (e.g., text-to-image) under one data type (e.g., image) and one context type (e.g., text). The multi-flow structure of Versatile Diffusion shows in the following diagram:
<p align="center">
<img src="https://huggingface.co/shi-labs/versatile-diffusion-model/resolve/main/assets/figures/VD_framework.png" width="99%">
</p>
- **Developed by:** Xingqian Xu, Atlas Wang, Eric Zhang, Kai Wang, and Humphrey Shi
- **Model type:** Diffusion-based multimodal generation model
- **Language(s):** English
- **License:** MIT
- **Resources for more information:** [GitHub Repository](https://github.com/SHI-Labs/Versatile-Diffusion), [Paper](https://arxiv.org/abs/2211.08332).
- **Cite as:**
```
@article{xu2022versatile,
title = {Versatile Diffusion: Text, Images and Variations All in One Diffusion Model},
author = {Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi},
year = 2022,
url = {https://arxiv.org/abs/2211.08332},
eprint = {2211.08332},
archiveprefix = {arXiv},
primaryclass = {cs.CV}
}
```
# Usage
You can use the model both with the [🧨Diffusers library](https://github.com/huggingface/diffusers) and the [SHI-Labs Versatile Diffusion codebase](https://github.com/SHI-Labs/Versatile-Diffusion).
## 🧨 Diffusers
Diffusers let's you both use a unified and more memory-efficient, task-specific pipelines.
**Make sure to install `transformers` from `"main"` in order to use this model.**:
```
pip install git+https://github.com/huggingface/transformers
```
## VersatileDiffusionPipeline
To use Versatile Diffusion for all tasks, it is recommend to use the [`VersatileDiffusionPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/versatile_diffusion#diffusers.VersatileDiffusionPipeline)
```py
#! pip install git+https://github.com/huggingface/transformers diffusers torch
from diffusers import VersatileDiffusionPipeline
import torch
import requests
from io import BytesIO
from PIL import Image
pipe = VersatileDiffusionPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
# prompt
prompt = "a red car"
# initial image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
# text to image
image = pipe.text_to_image(prompt).images[0]
# image variation
image = pipe.image_variation(image).images[0]
# image variation
image = pipe.dual_guided(prompt, image).images[0]
```
### Task Specific
The task specific pipelines load only the weights that are needed onto GPU.
You can find all task specific pipelines [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/versatile_diffusion#versatilediffusion).
You can use them as follows:
### Text to Image
```py
from diffusers import VersatileDiffusionTextToImagePipeline
import torch
pipe = VersatileDiffusionTextToImagePipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe.remove_unused_weights()
pipe = pipe.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe("an astronaut riding on a horse on mars", generator=generator).images[0]
image.save("./astronaut.png")
```
#### Image variations
```py
from diffusers import VersatileDiffusionImageVariationPipeline
import torch
import requests
from io import BytesIO
from PIL import Image
# download an initial image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
pipe = VersatileDiffusionImageVariationPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe(image, generator=generator).images[0]
image.save("./car_variation.png")
```
#### Dual-guided generation
```py
from diffusers import VersatileDiffusionDualGuidedPipeline
import torch
import requests
from io import BytesIO
from PIL import Image
# download an initial image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
text = "a red car in the sun"
pipe = VersatileDiffusionDualGuidedPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe.remove_unused_weights()
pipe = pipe.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(0)
text_to_image_strength = 0.75
image = pipe(prompt=text, image=image, text_to_image_strength=text_to_image_strength, generator=generator).images[0]
image.save("./red_car.png")
```
### Original GitHub Repository
Follow the instructions [here](https://github.com/SHI-Labs/Versatile-Diffusion/#evaluation).
# Cautions, Biases, and Content Acknowledgment
We would like the raise the awareness of users of this demo of its potential issues and concerns. Like previous large foundation models, Versatile Diffusion could be problematic in some cases, partially due to the imperfect training data and pretrained network (VAEs / context encoders) with limited scope. In its future research phase, VD may do better on tasks such as text-to-image, image-to-text, etc., with the help of more powerful VAEs, more sophisticated network designs, and more cleaned data. So far, we have kept all features available for research testing both to show the great potential of the VD framework and to collect important feedback to improve the model in the future. We welcome researchers and users to report issues with the HuggingFace community discussion feature or email the authors.
Beware that VD may output content that reinforces or exacerbates societal biases, as well as realistic faces, pornography, and violence. VD was trained on the LAION-2B dataset, which scraped non-curated online images and text, and may contain unintended exceptions as we removed illegal content. VD in this demo is meant only for research purposes. |