anonymitaet
commited on
update readme
Browse files
README.md
CHANGED
@@ -15,7 +15,7 @@ license_link: LICENSE
|
|
15 |
</div>
|
16 |
|
17 |
<div align="center">
|
18 |
-
<
|
19 |
</div>
|
20 |
|
21 |
|
@@ -53,12 +53,14 @@ license_link: LICENSE
|
|
53 |
- [Architecture](#architecture)
|
54 |
- [Training](#training)
|
55 |
- [Limitations](#limitations)
|
56 |
-
- [Citation](#citation)
|
57 |
- [Why Yi-VL?](#why-yi-vl)
|
58 |
- [Benchmarks](#benchmarks)
|
|
|
59 |
- [How to use Yi-VL?](#how-to-use-yi-vl)
|
60 |
- [Quick start](#quick-start)
|
61 |
-
- [
|
|
|
|
|
62 |
|
63 |
</details>
|
64 |
|
@@ -74,8 +76,6 @@ license_link: LICENSE
|
|
74 |
|
75 |
- Yi-34B-VL is the **first** open-source 34B vision language model worldwide.
|
76 |
|
77 |
-
<div align="right"> [ <a href="#yi-vision-language-model">Back to top ⬆️ </a> ] </div>
|
78 |
-
|
79 |
## Models
|
80 |
|
81 |
Yi-VL has released the following versions.
|
@@ -85,8 +85,6 @@ Model | Download
|
|
85 |
Yi-VL-34B |• [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-34B)
|
86 |
Yi-VL-6B | • [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-6B)
|
87 |
|
88 |
-
<div align="right"> [ <a href="#Yi-Vision-Language-Model">Back to top ⬆️ </a> ] </div>
|
89 |
-
|
90 |
## Features
|
91 |
|
92 |
Yi-VL offers the following features:
|
@@ -99,24 +97,22 @@ Yi-VL offers the following features:
|
|
99 |
|
100 |
- Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448x448.
|
101 |
|
102 |
-
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
|
103 |
-
|
104 |
## Architecture
|
105 |
|
106 |
Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of three primary components:
|
107 |
|
108 |
- Vision Transformer (ViT): it's initialized with [CLIP ViT-H/14 model](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and used for image encoding.
|
109 |
|
110 |
-
- Projection Module: it's designed to align image features with text feature
|
111 |
|
112 |
- Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese. To enhance the performance of Yi-VL models in bilingual multimodal understanding and generation, a rich dataset of bilingual image-text pairs is leveraged.
|
113 |
|
114 |
![Yi-VL architecture]()
|
115 |
|
116 |
-
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
|
117 |
-
|
118 |
## Training
|
119 |
|
|
|
|
|
120 |
Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
|
121 |
|
122 |
- Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224×224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs. The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
|
@@ -125,9 +121,20 @@ Yi-VL is trained to align visual information well to the semantic space of Yi LL
|
|
125 |
|
126 |
- Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including the data of image caption, VQA, grounding and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
|
127 |
|
128 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
129 |
|
130 |
-
|
|
|
|
|
131 |
|
132 |
## Limitations
|
133 |
|
@@ -151,27 +158,33 @@ This is the initial release of the Yi-VL, which comes with some known limitation
|
|
151 |
|
152 |
- Other limitations of the Yi LLM.
|
153 |
|
154 |
-
|
|
|
|
|
155 |
|
156 |
-
|
157 |
|
158 |
-
|
159 |
-
@article{tbd,
|
160 |
-
year={2024}
|
161 |
-
}
|
162 |
-
```
|
163 |
|
164 |
-
|
165 |
|
166 |
-
|
167 |
|
168 |
-
|
169 |
|
170 |
-
|
171 |
|
172 |
-
|
173 |
|
174 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
175 |
|
176 |
# How to use Yi-VL?
|
177 |
|
@@ -193,11 +206,23 @@ Notes:
|
|
193 |
|
194 |
- You need to set the parameter `mm_vision_tower` in `config.json` to the local ViT path.
|
195 |
|
196 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
197 |
|
198 |
This project makes use of open-source software/components. We acknowledge and are grateful to these developers for their contributions to the open-source community.
|
199 |
|
200 |
-
|
201 |
|
202 |
1. LLaVA
|
203 |
- Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yuheng Li, and Yong Jae Lee
|
@@ -208,14 +233,14 @@ This project makes use of open-source software/components. We acknowledge and ar
|
|
208 |
2. OpenClip
|
209 |
- Authors: Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt
|
210 |
- Source: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K
|
211 |
-
- License:
|
212 |
- Description: The ViT is initialized using the weights of OpenClip.
|
213 |
|
214 |
-
|
215 |
|
216 |
-
This project is licensed under the [yi-license](https://github.com/01-ai/Yi/blob/main/LICENSE). For more information on the license for this project,
|
217 |
|
218 |
-
|
219 |
|
220 |
- This attribution does not claim to cover all open-source components used. Please check individual components and their respective licenses for full details.
|
221 |
- The use of the open-source components is subject to the terms and conditions of the respective licenses.
|
|
|
15 |
</div>
|
16 |
|
17 |
<div align="center">
|
18 |
+
<h1 align="center">Yi Vision Language Model</h1>
|
19 |
</div>
|
20 |
|
21 |
|
|
|
53 |
- [Architecture](#architecture)
|
54 |
- [Training](#training)
|
55 |
- [Limitations](#limitations)
|
|
|
56 |
- [Why Yi-VL?](#why-yi-vl)
|
57 |
- [Benchmarks](#benchmarks)
|
58 |
+
- [Showcases](#showcases)
|
59 |
- [How to use Yi-VL?](#how-to-use-yi-vl)
|
60 |
- [Quick start](#quick-start)
|
61 |
+
- [Misc.](#misc)
|
62 |
+
- [Citation](#citation)
|
63 |
+
- [Acknowledgements and attributions](#acknowledgements-and-attributions)
|
64 |
|
65 |
</details>
|
66 |
|
|
|
76 |
|
77 |
- Yi-34B-VL is the **first** open-source 34B vision language model worldwide.
|
78 |
|
|
|
|
|
79 |
## Models
|
80 |
|
81 |
Yi-VL has released the following versions.
|
|
|
85 |
Yi-VL-34B |• [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-34B)
|
86 |
Yi-VL-6B | • [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-6B)
|
87 |
|
|
|
|
|
88 |
## Features
|
89 |
|
90 |
Yi-VL offers the following features:
|
|
|
97 |
|
98 |
- Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448x448.
|
99 |
|
|
|
|
|
100 |
## Architecture
|
101 |
|
102 |
Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of three primary components:
|
103 |
|
104 |
- Vision Transformer (ViT): it's initialized with [CLIP ViT-H/14 model](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and used for image encoding.
|
105 |
|
106 |
+
- Projection Module: it's designed to align image features with text feature space, consisting of a two-layer Multilayer Perceptron (MLP) with layer normalizations.
|
107 |
|
108 |
- Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese. To enhance the performance of Yi-VL models in bilingual multimodal understanding and generation, a rich dataset of bilingual image-text pairs is leveraged.
|
109 |
|
110 |
![Yi-VL architecture]()
|
111 |
|
|
|
|
|
112 |
## Training
|
113 |
|
114 |
+
### Training process
|
115 |
+
|
116 |
Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
|
117 |
|
118 |
- Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224×224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs. The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
|
|
|
121 |
|
122 |
- Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including the data of image caption, VQA, grounding and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
|
123 |
|
124 |
+
Below are the parameters configured for each stage.
|
125 |
+
|
126 |
+
Stage | Global batch size | Learning rate | Gradient clip | NO. of epochs
|
127 |
+
|---|---|---|---|---
|
128 |
+
Stage 1, 2 |4096|1e-4|0.5|1
|
129 |
+
Stage 3|256|2e-5|1.0|2
|
130 |
+
|
131 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/EGVHSWG4kAcX01xDaoeXS.png)
|
132 |
+
|
133 |
+
### Training resource consumption
|
134 |
|
135 |
+
- The training consumes 128 NVIDIA A100 GPUs.
|
136 |
+
|
137 |
+
- The total training time amounted to approximately 10 days for Yi-VL-34B and 3 days for Yi-VL-6B.
|
138 |
|
139 |
## Limitations
|
140 |
|
|
|
158 |
|
159 |
- Other limitations of the Yi LLM.
|
160 |
|
161 |
+
# Why Yi-VL?
|
162 |
+
|
163 |
+
## Benchmarks
|
164 |
|
165 |
+
Yi-VL outperforms all existing open-source models in [MMMU](https://mmmu-benchmark.github.io/#leaderboard) and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard), two advanced benchmarks that include massive multi-discipline multimodal questions.
|
166 |
|
167 |
+
- MMMU
|
|
|
|
|
|
|
|
|
168 |
|
169 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/6YuSakMCg3D2AozixdoZ0.png)
|
170 |
|
171 |
+
- CMMMU
|
172 |
|
173 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/kCmXuwLbLvequ93kjh3mg.png)
|
174 |
|
175 |
+
## Showcases
|
176 |
|
177 |
+
Yi-VL can describe images accurately and in detail with few hallucinations.
|
178 |
|
179 |
+
Below are some representative examples of detailed description and visual question answering, showcasing the capabilities of Yi-VL.
|
180 |
+
|
181 |
+
- English
|
182 |
+
|
183 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/iD83s2d8X-x6Acodp-SeO.png)
|
184 |
+
|
185 |
+
- Chinese
|
186 |
+
|
187 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/l_tLzugFtHk1dkVsFJE7B.png)
|
188 |
|
189 |
# How to use Yi-VL?
|
190 |
|
|
|
206 |
|
207 |
- You need to set the parameter `mm_vision_tower` in `config.json` to the local ViT path.
|
208 |
|
209 |
+
# Misc.
|
210 |
+
|
211 |
+
## Citation
|
212 |
+
|
213 |
+
If you find our work helpful, please feel free to cite us.
|
214 |
+
|
215 |
+
```
|
216 |
+
@article{tbd,
|
217 |
+
year={2024}
|
218 |
+
}
|
219 |
+
```
|
220 |
+
|
221 |
+
## Acknowledgements and attributions
|
222 |
|
223 |
This project makes use of open-source software/components. We acknowledge and are grateful to these developers for their contributions to the open-source community.
|
224 |
|
225 |
+
### List of used open-source projects
|
226 |
|
227 |
1. LLaVA
|
228 |
- Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yuheng Li, and Yong Jae Lee
|
|
|
233 |
2. OpenClip
|
234 |
- Authors: Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt
|
235 |
- Source: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K
|
236 |
+
- License: MIT
|
237 |
- Description: The ViT is initialized using the weights of OpenClip.
|
238 |
|
239 |
+
### License
|
240 |
|
241 |
+
This project is licensed under the [yi-license](https://github.com/01-ai/Yi/blob/main/LICENSE). For more information on the license for this project, see the LICENSE file in this repository.
|
242 |
|
243 |
+
### Notes
|
244 |
|
245 |
- This attribution does not claim to cover all open-source components used. Please check individual components and their respective licenses for full details.
|
246 |
- The use of the open-source components is subject to the terms and conditions of the respective licenses.
|