Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,402 @@
|
|
1 |
---
|
2 |
license: llama2
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: llama2
|
3 |
+
pipeline_tag: text-to-image
|
4 |
---
|
5 |
+
|
6 |
+
# LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
|
7 |
+
This is the latest version (LaVITv2) for the multi-modal large language model: **LaVIT**.
|
8 |
+
|
9 |
+
In this version, We further improve LaVIT's image generation capability. In the updated version, the **aesthetic** and **prompt-alignment** of generated images has been improved. The **probability of watermark** is also greatly reduced. The improvements are summarized as follows:
|
10 |
+
* Using LaVIT to generate better synthetic captions for the noisy Laion-Aesthetic (Like DALL-E 3).
|
11 |
+
* Add the high-aesthetic training images from the open-source JourneyDB dataset.
|
12 |
+
* Using the 20M synthetic Laion-Aesthetic data and 4.2M JourneyDB data to further finetune the LLM for 8K steps.
|
13 |
+
|
14 |
+
|
15 |
+
[[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]
|
16 |
+
|
17 |
+
|
18 |
+
## Setup
|
19 |
+
|
20 |
+
### Requirements
|
21 |
+
|
22 |
+
The code for this repo is tested with PyTorch 1.13.1 and CUDA 11.7.
|
23 |
+
You should first install and configure the Pytorch Environment (including torch and torchvision) can then install the requirements with the following commands:
|
24 |
+
|
25 |
+
```shell
|
26 |
+
git clone https://github.com/jy0205/LaVIT.git
|
27 |
+
cd LaVIT
|
28 |
+
pip install -r requirements.txt
|
29 |
+
```
|
30 |
+
|
31 |
+
* (Optional) We recommend using memory efficient attention by installing xFormers following the instructions in [here](https://huggingface.co/docs/diffusers/main/en/optimization/xformers). Then, you can set the argument `use_xformers=True` in `build_model` function to save the GPU memory and speed up inference.
|
32 |
+
|
33 |
+
### Model Zoo
|
34 |
+
We release the LaVIT weight that is built upon [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) as the large language model.
|
35 |
+
> Note: Due to the license restrictions of Llama1, we cannot publish its weights. Thus, we release the weight of LaVIT based on the Llama2.
|
36 |
+
|
37 |
+
The latest pre-trained weight of LaVIT can be found on the huggingface from [here](https://huggingface.co/rain1011/LaVIT-7B-v2), which will take around 25GB of disk space. We strongly recommend you to download and use the latest version of LaVIT. LaVIT achieves state-of-the-arts performance on various multi-modal downstream tasks. The detailed quantitive results are shown as follows:
|
38 |
+
|
39 |
+
|
40 |
+
#### Zero-shot Multi-modal Understanding
|
41 |
+
|
42 |
+
<table>
|
43 |
+
<thead align="center">
|
44 |
+
<tr>
|
45 |
+
<th rowspan="2">Model</th>
|
46 |
+
<th colspan="3">Image Captioning</th>
|
47 |
+
<th colspan="4">Visual Question Answering</th>
|
48 |
+
</tr>
|
49 |
+
<tr>
|
50 |
+
<th>COCO</th>
|
51 |
+
<th>NoCaps</th>
|
52 |
+
<th>Flickr30K</th>
|
53 |
+
<th>VQAv2</th>
|
54 |
+
<th>OK-VQA</th>
|
55 |
+
<th>GQA</th>
|
56 |
+
<th>VizWiz</th>
|
57 |
+
</tr>
|
58 |
+
</thead>
|
59 |
+
<tbody align="center">
|
60 |
+
<tr>
|
61 |
+
<td>Flamingo-3B</td>
|
62 |
+
<td>73.0</td>
|
63 |
+
<td>-</td>
|
64 |
+
<td>60.6</td>
|
65 |
+
<td>49.2</td>
|
66 |
+
<td>41.2</td>
|
67 |
+
<td>-</td>
|
68 |
+
<td>28.9</td>
|
69 |
+
</tr>
|
70 |
+
<tr>
|
71 |
+
<td>Flamingo-9B</td>
|
72 |
+
<td>79.4</td>
|
73 |
+
<td>-</td>
|
74 |
+
<td>61.5</td>
|
75 |
+
<td>51.8</td>
|
76 |
+
<td>44.7</td>
|
77 |
+
<td>-</td>
|
78 |
+
<td>28.8</td>
|
79 |
+
</tr>
|
80 |
+
<tr>
|
81 |
+
<td>OpenFlamingo-9B</td>
|
82 |
+
<td>79.5</td>
|
83 |
+
<td>-</td>
|
84 |
+
<td>59.5</td>
|
85 |
+
<td>52.7</td>
|
86 |
+
<td>37.8</td>
|
87 |
+
<td>-</td>
|
88 |
+
<td>27.5</td>
|
89 |
+
</tr>
|
90 |
+
<tr>
|
91 |
+
<td>MetaLM</td>
|
92 |
+
<td>82.2</td>
|
93 |
+
<td>-</td>
|
94 |
+
<td>43.4</td>
|
95 |
+
<td>41.1</td>
|
96 |
+
<td>11.4</td>
|
97 |
+
<td>-</td>
|
98 |
+
<td>-</td>
|
99 |
+
</tr>
|
100 |
+
<tr>
|
101 |
+
<td>Kosmos-1</td>
|
102 |
+
<td>84.7</td>
|
103 |
+
<td>-</td>
|
104 |
+
<td>67.1</td>
|
105 |
+
<td>51.0</td>
|
106 |
+
<td>-</td>
|
107 |
+
<td>-</td>
|
108 |
+
<td>29.2</td>
|
109 |
+
</tr>
|
110 |
+
<tr>
|
111 |
+
<td>Kosmos-2</td>
|
112 |
+
<td>-</td>
|
113 |
+
<td>-</td>
|
114 |
+
<td>80.5</td>
|
115 |
+
<td>51.1</td>
|
116 |
+
<td>-</td>
|
117 |
+
<td>-</td>
|
118 |
+
<td>-</td>
|
119 |
+
</tr>
|
120 |
+
<tr>
|
121 |
+
<td>BLIP-2 (Vicuna-7B)</td>
|
122 |
+
<td>-</td>
|
123 |
+
<td>107.5</td>
|
124 |
+
<td>74.9</td>
|
125 |
+
<td>-</td>
|
126 |
+
<td>-</td>
|
127 |
+
<td>41.3</td>
|
128 |
+
<td>25.3</td>
|
129 |
+
</tr>
|
130 |
+
<tr>
|
131 |
+
<td>BLIP-2 (Vicuna-13B)</td>
|
132 |
+
<td>-</td>
|
133 |
+
<td>103.9</td>
|
134 |
+
<td>71.6</td>
|
135 |
+
<td>-</td>
|
136 |
+
<td>-</td>
|
137 |
+
<td>32.3</td>
|
138 |
+
<td>19.6</td>
|
139 |
+
</tr>
|
140 |
+
<tr>
|
141 |
+
<td>CM3Leon-7B</td>
|
142 |
+
<td>61.6</td>
|
143 |
+
<td>-</td>
|
144 |
+
<td>-</td>
|
145 |
+
<td>47.6</td>
|
146 |
+
<td>-</td>
|
147 |
+
<td>-</td>
|
148 |
+
<td>37.6</td>
|
149 |
+
</tr>
|
150 |
+
<tr>
|
151 |
+
<td>Emu (LLaMA-1-13B)</td>
|
152 |
+
<td>112.4</td>
|
153 |
+
<td>-</td>
|
154 |
+
<td>-</td>
|
155 |
+
<td>52.0</td>
|
156 |
+
<td>38.2</td>
|
157 |
+
<td>-</td>
|
158 |
+
<td>34.2</td>
|
159 |
+
</tr>
|
160 |
+
<tr>
|
161 |
+
<td>LaVIT (LLaMA-1-7B)</td>
|
162 |
+
<td>134.0</td>
|
163 |
+
<td><b>114.2</b></td>
|
164 |
+
<td>83.0</td>
|
165 |
+
<td>66.0</td>
|
166 |
+
<td>54.6</td>
|
167 |
+
<td>46.8</td>
|
168 |
+
<td>38.5</td>
|
169 |
+
</tr>
|
170 |
+
<tr>
|
171 |
+
<td>LaVIT (LLaMA-2-7B)</td>
|
172 |
+
<td><b>134.6</b></td>
|
173 |
+
<td>113.1</td>
|
174 |
+
<td><b>83.2</b></td>
|
175 |
+
<td><b>68.2</b></td>
|
176 |
+
<td><b>55.7</b></td>
|
177 |
+
<td><b>48.0</b></td>
|
178 |
+
<td><b>45.3</b></td>
|
179 |
+
</tr>
|
180 |
+
</tbody>
|
181 |
+
</table>
|
182 |
+
|
183 |
+
#### Zero-shot Text-to-Image Generation
|
184 |
+
|
185 |
+
<table>
|
186 |
+
<thead>
|
187 |
+
<tr>
|
188 |
+
<th>Method</th>
|
189 |
+
<th>Model</th>
|
190 |
+
<th>Model type</th>
|
191 |
+
<th>FID</th>
|
192 |
+
</tr>
|
193 |
+
</thead>
|
194 |
+
<tbody align="center">
|
195 |
+
<tr>
|
196 |
+
<td rowspan="9">Text2Image Specialist</td>
|
197 |
+
<td>DALL-E</td>
|
198 |
+
<td>Autoregressive</td>
|
199 |
+
<td>28.0</td>
|
200 |
+
</tr>
|
201 |
+
<tr>
|
202 |
+
<td>CogView</td>
|
203 |
+
<td>Autoregressive</td>
|
204 |
+
<td>27.1</td>
|
205 |
+
</tr>
|
206 |
+
<tr>
|
207 |
+
<td>StableDiffusion</td>
|
208 |
+
<td>Diffusion</td>
|
209 |
+
<td>12.6</td>
|
210 |
+
</tr>
|
211 |
+
<tr>
|
212 |
+
<td>GLIDE</td>
|
213 |
+
<td>Diffusion</td>
|
214 |
+
<td>12.2</td>
|
215 |
+
</tr>
|
216 |
+
<tr>
|
217 |
+
<td>DALL-E 2</td>
|
218 |
+
<td>Diffusion</td>
|
219 |
+
<td>10.4</td>
|
220 |
+
</tr>
|
221 |
+
<tr>
|
222 |
+
<td>Make-A-Scene</td>
|
223 |
+
<td>Autoregressive</td>
|
224 |
+
<td>11.8</td>
|
225 |
+
</tr>
|
226 |
+
<tr>
|
227 |
+
<td>MUSE-7.6B</td>
|
228 |
+
<td>Non-Autoregressive</td>
|
229 |
+
<td>7.9</td>
|
230 |
+
</tr>
|
231 |
+
<tr>
|
232 |
+
<td>Imagen-3.4B</td>
|
233 |
+
<td>Diffusion</td>
|
234 |
+
<td>7.3</td>
|
235 |
+
</tr>
|
236 |
+
<tr>
|
237 |
+
<td>Parti-20B</td>
|
238 |
+
<td>Autoregressive</td>
|
239 |
+
<td><b>7.2</b></td>
|
240 |
+
</tr>
|
241 |
+
<tr>
|
242 |
+
<td rowspan="5">Multimodal Large Langauge Model</td>
|
243 |
+
<td>GILL (OPT-6.7B)</td>
|
244 |
+
<td>LLM</td>
|
245 |
+
<td>12.2</td>
|
246 |
+
</tr>
|
247 |
+
<tr>
|
248 |
+
<td>Emu (LLaMA-1-13B)</td>
|
249 |
+
<td>LLM</td>
|
250 |
+
<td>11.7</td>
|
251 |
+
</tr>
|
252 |
+
<tr>
|
253 |
+
<td>CM3Leon-7B </td>
|
254 |
+
<td>LLM</td>
|
255 |
+
<td>10.8</td>
|
256 |
+
</tr>
|
257 |
+
<tr>
|
258 |
+
<td>LaVIT (LLaMA-1-7B)</td>
|
259 |
+
<td>LLM</td>
|
260 |
+
<td>7.4</td>
|
261 |
+
</tr>
|
262 |
+
<tr>
|
263 |
+
<td>LaVIT (LLaMA-2-7B)</td>
|
264 |
+
<td>LLM</td>
|
265 |
+
<td><b>7.2</b></td>
|
266 |
+
</tr>
|
267 |
+
</tbody>
|
268 |
+
</table>
|
269 |
+
|
270 |
+
## Usage
|
271 |
+
LaVIT can serve as a multi-modal generalist to perform both multi-modal comprehension and generation. Below, we provide some examples. Only a few lines of code are needed to use **LaVIT** for inference. We also provide the detailed examples in the following jupyter notebooks for learning how to interact with LaVIT.
|
272 |
+
|
273 |
+
* `understanding.ipynb` : examples for multi-modal understanding
|
274 |
+
* `text2image_synthesis.ipynb`: examples for the text-to-image generation.
|
275 |
+
* `multimodal_synthesis.ipynb`: examples for image synthesis with multi-modal prompts.
|
276 |
+
|
277 |
+
### Multi-modal Understanding
|
278 |
+
|
279 |
+
```python
|
280 |
+
import os
|
281 |
+
import random
|
282 |
+
import torch
|
283 |
+
import torch.nn as nn
|
284 |
+
from models import build_model
|
285 |
+
from PIL import Image
|
286 |
+
|
287 |
+
seed = 1234
|
288 |
+
random.seed(seed)
|
289 |
+
torch.manual_seed(seed)
|
290 |
+
|
291 |
+
# The local directory you save the LaVIT pre-trained weight,
|
292 |
+
# it will automatically download the checkpoint from huggingface
|
293 |
+
model_path = '/path/LaVIT_weight'
|
294 |
+
|
295 |
+
# Using BFloat16 during inference
|
296 |
+
model_dtype = 'bf16' # Or set to fp16 to enable float16 inference
|
297 |
+
|
298 |
+
# Inference using GPU-0
|
299 |
+
device_id = 0
|
300 |
+
torch.cuda.set_device(device_id)
|
301 |
+
device = torch.device('cuda')
|
302 |
+
|
303 |
+
# Building LaVIT for understanding and load its weight from huggingface
|
304 |
+
model = build_model(model_path=model_path, model_dtype=model_dtype,
|
305 |
+
device_id=device_id, use_xformers=False, understanding=True)
|
306 |
+
model = model.to(device)
|
307 |
+
|
308 |
+
# Image Captioning
|
309 |
+
image_path = 'demo/caption_image.jpg'
|
310 |
+
caption = model.generate({"image": image_path})[0]
|
311 |
+
print(caption)
|
312 |
+
# an old photo of a horse and buggy in front of a building
|
313 |
+
|
314 |
+
# Visual Question Answering
|
315 |
+
image_path = 'demo/qa_image.jpg'
|
316 |
+
question = "What's that drink in the glass?"
|
317 |
+
answer = model.predict_answers({"image": image_path, "text_input": question}, max_len=10)[0]
|
318 |
+
print("The answer is: ", answer)
|
319 |
+
# The answer is: orange juice
|
320 |
+
```
|
321 |
+
|
322 |
+
### Text-to-Image Synthesis
|
323 |
+
|
324 |
+
For the Image generation, the Classifier-Free Guidance scale is important. A larger scale will encourage the model to generate samples highly related to the input prompt while sacrificing the image quality. We set `guidance_scale_for_llm=4.0` by default, you can increase this scale (e.g., 5.0 or 6.0) to encourage the generated image to follow the semantics of given prompts. Besides, you can modify the `ratio` to enable to generate the images with different aspect ratios.
|
325 |
+
|
326 |
+
```python
|
327 |
+
import os
|
328 |
+
import torch
|
329 |
+
import random
|
330 |
+
import torch.nn as nn
|
331 |
+
from models import build_model
|
332 |
+
from PIL import Image
|
333 |
+
|
334 |
+
seed = 1234
|
335 |
+
random.seed(seed)
|
336 |
+
torch.manual_seed(seed)
|
337 |
+
|
338 |
+
# The local directory you save the LaVIT pre-trained weight,
|
339 |
+
# it will automatically download the checkpoint from huggingface
|
340 |
+
model_path = '/path/LaVIT_weight'
|
341 |
+
|
342 |
+
# Using BFloat16 during inference
|
343 |
+
model_dtype = 'bf16' # Or set to fp16 to enable float16 inference
|
344 |
+
|
345 |
+
# Inference using GPU-0
|
346 |
+
device_id = 0
|
347 |
+
torch.cuda.set_device(device_id)
|
348 |
+
device = torch.device('cuda')
|
349 |
+
torch_dtype = torch.bfloat16 if model_dtype=="bf16" else torch.float16
|
350 |
+
|
351 |
+
# Building LaVIT for Generation and load the weight from huggingface
|
352 |
+
# You can set `use_xformers=True` if have installed xformers to save GPU mempry and speed up
|
353 |
+
model = build_model(model_path=model_path, model_dtype=model_dtype, device_id=device_id,
|
354 |
+
use_xformers=False, understanding=False, load_tokenizer=False)
|
355 |
+
model = model.to(device)
|
356 |
+
|
357 |
+
# Text-to-Image Generation
|
358 |
+
prompt = "a sculpture of a duck made of wool"
|
359 |
+
|
360 |
+
# LaVIT support 6 different image aspect ratios
|
361 |
+
ratio_dict = {
|
362 |
+
'1:1' : (1024, 1024),
|
363 |
+
'4:3' : (896, 1152),
|
364 |
+
'3:2' : (832, 1216),
|
365 |
+
'16:9' : (768, 1344),
|
366 |
+
'2:3' : (1216, 832),
|
367 |
+
'3:4' : (1152, 896),
|
368 |
+
}
|
369 |
+
|
370 |
+
# The image aspect ratio you want to generate
|
371 |
+
ratio = '1:1'
|
372 |
+
height, width = ratio_dict[ratio]
|
373 |
+
|
374 |
+
with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
|
375 |
+
images = model.generate_image(prompt, width=width, height=height,
|
376 |
+
num_return_images=1, guidance_scale_for_llm=4.0, num_inference_steps=25)
|
377 |
+
|
378 |
+
images[0].save("output/i2t_output.jpg")
|
379 |
+
```
|
380 |
+
|
381 |
+
## Evaluation
|
382 |
+
The batch evaluation code with multiple GPUs on the adopted multi-modal benchmarks will be released in the following days.
|
383 |
+
|
384 |
+
## Acknowledgement
|
385 |
+
We are grateful for the following awesome projects when implementing LaVIT:
|
386 |
+
* [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
|
387 |
+
* [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
|
388 |
+
* [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
|
389 |
+
* [BEIT](https://github.com/microsoft/unilm/tree/master/beit2): Masked Image Modeling with Vector-Quantized Visual Tokenizers
|
390 |
+
* [Diffusers](https://github.com/huggingface/diffusers): State-of-the-art diffusion models for image and audio generation in PyTorch.
|
391 |
+
|
392 |
+
|
393 |
+
## <a name="Citing"></a>Citation
|
394 |
+
Consider giving this repository a star and cite LaVIT in your publications if it helps your research.
|
395 |
+
|
396 |
+
```
|
397 |
+
@article{jin2023unified,
|
398 |
+
title={Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization},
|
399 |
+
author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},
|
400 |
+
journal={arXiv preprint arXiv:2309.04669},
|
401 |
+
year={2023}
|
402 |
+
}
|