anonymitaet commited on
Commit
24cefa3
·
verified ·
1 Parent(s): ba27432

update readme

Browse files
Files changed (1) hide show
  1. README.md +58 -33
README.md CHANGED
@@ -15,7 +15,7 @@ license_link: LICENSE
15
  </div>
16
 
17
  <div align="center">
18
- <h2 align="center">Yi Vision Language Model</h2>
19
  </div>
20
 
21
 
@@ -53,12 +53,14 @@ license_link: LICENSE
53
  - [Architecture](#architecture)
54
  - [Training](#training)
55
  - [Limitations](#limitations)
56
- - [Citation](#citation)
57
  - [Why Yi-VL?](#why-yi-vl)
58
  - [Benchmarks](#benchmarks)
 
59
  - [How to use Yi-VL?](#how-to-use-yi-vl)
60
  - [Quick start](#quick-start)
61
- - [Acknowledgements and attributions](#acknowledgements-and-attributions)
 
 
62
 
63
  </details>
64
 
@@ -74,8 +76,6 @@ license_link: LICENSE
74
 
75
  - Yi-34B-VL is the **first** open-source 34B vision language model worldwide.
76
 
77
- <div align="right"> [ <a href="#yi-vision-language-model">Back to top ⬆️ </a> ] </div>
78
-
79
  ## Models
80
 
81
  Yi-VL has released the following versions.
@@ -85,8 +85,6 @@ Model | Download
85
  Yi-VL-34B |• [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-34B)
86
  Yi-VL-6B | • [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-6B)
87
 
88
- <div align="right"> [ <a href="#Yi-Vision-Language-Model">Back to top ⬆️ </a> ] </div>
89
-
90
  ## Features
91
 
92
  Yi-VL offers the following features:
@@ -99,24 +97,22 @@ Yi-VL offers the following features:
99
 
100
  - Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448x448.
101
 
102
- <div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
103
-
104
  ## Architecture
105
 
106
  Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of three primary components:
107
 
108
  - Vision Transformer (ViT): it's initialized with [CLIP ViT-H/14 model](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and used for image encoding.
109
 
110
- - Projection Module: it's designed to align image features with text feature spcae, consists of a two-layer Multilayer Perceptron (MLP) with layer normalizations.
111
 
112
  - Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese. To enhance the performance of Yi-VL models in bilingual multimodal understanding and generation, a rich dataset of bilingual image-text pairs is leveraged.
113
 
114
  ![Yi-VL architecture]()
115
 
116
- <div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
117
-
118
  ## Training
119
 
 
 
120
  Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
121
 
122
  - Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224&times;224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs. The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
@@ -125,9 +121,20 @@ Yi-VL is trained to align visual information well to the semantic space of Yi LL
125
 
126
  - Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including the data of image caption, VQA, grounding and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
127
 
128
- In Stage 1 and 2, the global batch size, the learning rate, the gradient clip and the number of epoch are set to 4096, 1e-4, 0.5 and 1, respectively. In Stage 3, these parameters are adjusted to 256, 2e-5, 1.0 and 2. The training consumes 128 NVIDIA A100 GPUs. The total training time amounted to approximately 10 days for Yi-VL-34B and 3 days for Yi-VL-6B.
 
 
 
 
 
 
 
 
 
129
 
130
- <div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
 
 
131
 
132
  ## Limitations
133
 
@@ -151,27 +158,33 @@ This is the initial release of the Yi-VL, which comes with some known limitation
151
 
152
  - Other limitations of the Yi LLM.
153
 
154
- ## Citation
 
 
155
 
156
- If you find our work helpful, please feel free to cite us.
157
 
158
- ```
159
- @article{tbd,
160
- year={2024}
161
- }
162
- ```
163
 
164
- <div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
165
 
166
- # Why Yi-VL?
167
 
168
- ## Benchmarks
169
 
170
- Yi-VL outperforms all existing open-source models in [MMMU](https://mmmu-benchmark.github.io/#leaderboard) and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard), two advanced benchmarks that include massive multi-discipline multimodal questions.
171
 
172
- ![Yi-VL benchmark]()
173
 
174
- <div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
 
 
 
 
 
 
 
 
175
 
176
  # How to use Yi-VL?
177
 
@@ -193,11 +206,23 @@ Notes:
193
 
194
  - You need to set the parameter `mm_vision_tower` in `config.json` to the local ViT path.
195
 
196
- # Acknowledgements and attributions
 
 
 
 
 
 
 
 
 
 
 
 
197
 
198
  This project makes use of open-source software/components. We acknowledge and are grateful to these developers for their contributions to the open-source community.
199
 
200
- ## List of used open-source projects
201
 
202
  1. LLaVA
203
  - Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yuheng Li, and Yong Jae Lee
@@ -208,14 +233,14 @@ This project makes use of open-source software/components. We acknowledge and ar
208
  2. OpenClip
209
  - Authors: Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt
210
  - Source: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K
211
- - License: mit
212
  - Description: The ViT is initialized using the weights of OpenClip.
213
 
214
- ## License
215
 
216
- This project is licensed under the [yi-license](https://github.com/01-ai/Yi/blob/main/LICENSE). For more information on the license for this project, please see the LICENSE file in this repository.
217
 
218
- ## Notes
219
 
220
  - This attribution does not claim to cover all open-source components used. Please check individual components and their respective licenses for full details.
221
  - The use of the open-source components is subject to the terms and conditions of the respective licenses.
 
15
  </div>
16
 
17
  <div align="center">
18
+ <h1 align="center">Yi Vision Language Model</h1>
19
  </div>
20
 
21
 
 
53
  - [Architecture](#architecture)
54
  - [Training](#training)
55
  - [Limitations](#limitations)
 
56
  - [Why Yi-VL?](#why-yi-vl)
57
  - [Benchmarks](#benchmarks)
58
+ - [Showcases](#showcases)
59
  - [How to use Yi-VL?](#how-to-use-yi-vl)
60
  - [Quick start](#quick-start)
61
+ - [Misc.](#misc)
62
+ - [Citation](#citation)
63
+ - [Acknowledgements and attributions](#acknowledgements-and-attributions)
64
 
65
  </details>
66
 
 
76
 
77
  - Yi-34B-VL is the **first** open-source 34B vision language model worldwide.
78
 
 
 
79
  ## Models
80
 
81
  Yi-VL has released the following versions.
 
85
  Yi-VL-34B |• [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-34B)
86
  Yi-VL-6B | • [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-6B)
87
 
 
 
88
  ## Features
89
 
90
  Yi-VL offers the following features:
 
97
 
98
  - Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448x448.
99
 
 
 
100
  ## Architecture
101
 
102
  Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of three primary components:
103
 
104
  - Vision Transformer (ViT): it's initialized with [CLIP ViT-H/14 model](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and used for image encoding.
105
 
106
+ - Projection Module: it's designed to align image features with text feature space, consisting of a two-layer Multilayer Perceptron (MLP) with layer normalizations.
107
 
108
  - Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese. To enhance the performance of Yi-VL models in bilingual multimodal understanding and generation, a rich dataset of bilingual image-text pairs is leveraged.
109
 
110
  ![Yi-VL architecture]()
111
 
 
 
112
  ## Training
113
 
114
+ ### Training process
115
+
116
  Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
117
 
118
  - Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224&times;224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs. The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
 
121
 
122
  - Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including the data of image caption, VQA, grounding and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
123
 
124
+ Below are the parameters configured for each stage.
125
+
126
+ Stage | Global batch size | Learning rate | Gradient clip | NO. of epochs
127
+ |---|---|---|---|---
128
+ Stage 1, 2 |4096|1e-4|0.5|1
129
+ Stage 3|256|2e-5|1.0|2
130
+
131
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/EGVHSWG4kAcX01xDaoeXS.png)
132
+
133
+ ### Training resource consumption
134
 
135
+ - The training consumes 128 NVIDIA A100 GPUs.
136
+
137
+ - The total training time amounted to approximately 10 days for Yi-VL-34B and 3 days for Yi-VL-6B.
138
 
139
  ## Limitations
140
 
 
158
 
159
  - Other limitations of the Yi LLM.
160
 
161
+ # Why Yi-VL?
162
+
163
+ ## Benchmarks
164
 
165
+ Yi-VL outperforms all existing open-source models in [MMMU](https://mmmu-benchmark.github.io/#leaderboard) and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard), two advanced benchmarks that include massive multi-discipline multimodal questions.
166
 
167
+ - MMMU
 
 
 
 
168
 
169
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/6YuSakMCg3D2AozixdoZ0.png)
170
 
171
+ - CMMMU
172
 
173
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/kCmXuwLbLvequ93kjh3mg.png)
174
 
175
+ ## Showcases
176
 
177
+ Yi-VL can describe images accurately and in detail with few hallucinations.
178
 
179
+ Below are some representative examples of detailed description and visual question answering, showcasing the capabilities of Yi-VL.
180
+
181
+ - English
182
+
183
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/iD83s2d8X-x6Acodp-SeO.png)
184
+
185
+ - Chinese
186
+
187
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/l_tLzugFtHk1dkVsFJE7B.png)
188
 
189
  # How to use Yi-VL?
190
 
 
206
 
207
  - You need to set the parameter `mm_vision_tower` in `config.json` to the local ViT path.
208
 
209
+ # Misc.
210
+
211
+ ## Citation
212
+
213
+ If you find our work helpful, please feel free to cite us.
214
+
215
+ ```
216
+ @article{tbd,
217
+ year={2024}
218
+ }
219
+ ```
220
+
221
+ ## Acknowledgements and attributions
222
 
223
  This project makes use of open-source software/components. We acknowledge and are grateful to these developers for their contributions to the open-source community.
224
 
225
+ ### List of used open-source projects
226
 
227
  1. LLaVA
228
  - Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yuheng Li, and Yong Jae Lee
 
233
  2. OpenClip
234
  - Authors: Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt
235
  - Source: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K
236
+ - License: MIT
237
  - Description: The ViT is initialized using the weights of OpenClip.
238
 
239
+ ### License
240
 
241
+ This project is licensed under the [yi-license](https://github.com/01-ai/Yi/blob/main/LICENSE). For more information on the license for this project, see the LICENSE file in this repository.
242
 
243
+ ### Notes
244
 
245
  - This attribution does not claim to cover all open-source components used. Please check individual components and their respective licenses for full details.
246
  - The use of the open-source components is subject to the terms and conditions of the respective licenses.