File size: 11,013 Bytes
8581fc1 71209f4 eaa29e6 71209f4 eaa29e6 71209f4 a8fa12e 71209f4 eaa29e6 71209f4 ee35664 eaa29e6 71209f4 ba27432 71209f4 ba27432 71209f4 ba27432 71209f4 ba27432 71209f4 ba27432 71209f4 ba27432 71209f4 ba27432 71209f4 a8fa12e 71209f4 ba27432 71209f4 a8fa12e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
---
license: other
license_name: yi-license
license_link: LICENSE
---
<div align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/Yi_logo_icon_dark.svg" width="200px">
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/Yi_logo_icon_light.svg" width="200px">
<img alt="specify theme context for images" src="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/Yi_logo_icon_light.svg" width="200px">
</picture>
</div>
<div align="center">
<h2 align="center">Yi Vision Language Model</h2>
</div>
<div align="center">
<h3 align="center">Better Bilingual Multimodal Model</h3>
</div>
<p align="center">
🤗 <a href="https://huggingface.co/01-ai" target="_blank">Hugging Face</a> • 🤖 <a href="https://www.modelscope.cn/organization/01ai/" target="_blank">ModelScope</a> • ✡️ <a href="https://wisemodel.cn/organization/01.AI" target="_blank">WiseModel</a>
</p>
<p align="center">
👩🚀 Ask questions or discuss ideas on <a href="https://github.com/01-ai/Yi/discussions" target="_blank"> GitHub </a>!
</p>
<p align="center">
👋 Join us 💬 <a href="https://github.com/01-ai/Yi/issues/43#issuecomment-1827285245" target="_blank"> WeChat (Chinese) </a>!
</p>
<p align="center">
📚 Grow at <a href="https://github.com/01-ai/Yi/blob/main/docs/learning_hub.md"> Yi Learning Hub </a>!
</p>
<hr>
<!-- DO NOT REMOVE ME -->
<details open>
<summary></b>📕 Table of Contents</b></summary>
- [What is Yi-VL?](#what-is-yi-vl)
- [Overview](#overview)
- [Models](#models)
- [Features](#features)
- [Architecture](#architecture)
- [Training](#training)
- [Limitations](#limitations)
- [Citation](#citation)
- [Why Yi-VL?](#why-yi-vl)
- [Benchmarks](#benchmarks)
- [How to use Yi-VL?](#how-to-use-yi-vl)
- [Quick start](#quick-start)
- [Acknowledgements and attributions](#acknowledgements-and-attributions)
</details>
<hr>
# What is Yi-VL?
## Overview
- **Yi Visual Language (Yi-VL)** model is the open-source, multimodal version of the Yi **Large Language Model (LLM)** series, enabling content comprehension, recognition, and multi-round conversations about images.
- Yi-VL demonstrates exceptional performance, **ranking first** among all existing open-source models in the latest benchmarks including [MMMU](https://mmmu-benchmark.github.io/#leaderboard) in English and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard) in Chinese (based on data available up to January 2024).
- Yi-34B-VL is the **first** open-source 34B vision language model worldwide.
<div align="right"> [ <a href="#yi-vision-language-model">Back to top ⬆️ </a> ] </div>
## Models
Yi-VL has released the following versions.
Model | Download
|---|---
Yi-VL-34B |• [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-34B)
Yi-VL-6B | • [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-6B)
<div align="right"> [ <a href="#Yi-Vision-Language-Model">Back to top ⬆️ </a> ] </div>
## Features
Yi-VL offers the following features:
- Multi-round text-image conversations: Yi-VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.
- Bilingual text support: Yi-VL supports conversations in both English and Chinese, including text recognition in images.
- Strong image comprehension: Yi-VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
- Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448x448.
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
## Architecture
Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of three primary components:
- Vision Transformer (ViT): it's initialized with [CLIP ViT-H/14 model](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and used for image encoding.
- Projection Module: it's designed to align image features with text feature spcae, consists of a two-layer Multilayer Perceptron (MLP) with layer normalizations.
- Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese. To enhance the performance of Yi-VL models in bilingual multimodal understanding and generation, a rich dataset of bilingual image-text pairs is leveraged.
![Yi-VL architecture]()
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
## Training
Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
- Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224×224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs. The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
- Stage 2: The image resolution of ViT is scaled up to 448×448, and the parameters of ViT and the projection module are trained. It aims to further boost the model's capability for discerning intricate visual details. The dataset used in this stage includes about 25 million image-text pairs.
- Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including the data of image caption, VQA, grounding and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
In Stage 1 and 2, the global batch size, the learning rate, the gradient clip and the number of epoch are set to 4096, 1e-4, 0.5 and 1, respectively. In Stage 3, these parameters are adjusted to 256, 2e-5, 1.0 and 2. The training consumes 128 NVIDIA A100 GPUs. The total training time amounted to approximately 10 days for Yi-VL-34B and 3 days for Yi-VL-6B.
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
## Limitations
This is the initial release of the Yi-VL, which comes with some known limitations. It is recommended to carefully evaluate potential risks before adopting any models.
- Feature limitation
- Visual question answering is supported. Other features like text-to-3D and image-to-video are not yet supported.
- A single image rather than several images can be accepted as an input.
- Hallucination problem
- There is a certain possibility of generating content that does not exist in the image.
- In scenes containing multiple objects, some objects might be incorrectly identified or described with insufficient detail.
- Resolution issue
- Yi-VL is trained on images with a resolution of 448×448. During inference, inputs of any resolution are resized to 448×448. Low-resolution images may result in information loss, and more fine-grained images (above 448) do not bring in extra knowledge.
- Other limitations of the Yi LLM.
## Citation
If you find our work helpful, please feel free to cite us.
```
@article{tbd,
year={2024}
}
```
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
# Why Yi-VL?
## Benchmarks
Yi-VL outperforms all existing open-source models in [MMMU](https://mmmu-benchmark.github.io/#leaderboard) and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard), two advanced benchmarks that include massive multi-discipline multimodal questions.
![Yi-VL benchmark]()
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
# How to use Yi-VL?
## Quick start
You can perform inference using the code from [LLaVA](https://github.com/haotian-liu/LLaVA). For detailed steps, see [simple startup for pretraining](https://github.com/haotian-liu/LLaVA/pull/966).
Notes:
- You need to modify the system prompt as follows.
```
This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。
### Human: <image_placeholder>
What is it in the image?
### Assistant:
```
- You need to set the parameter `mm_vision_tower` in `config.json` to the local ViT path.
# Acknowledgements and attributions
This project makes use of open-source software/components. We acknowledge and are grateful to these developers for their contributions to the open-source community.
## List of used open-source projects
1. LLaVA
- Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yuheng Li, and Yong Jae Lee
- Source: https://github.com/haotian-liu/LLaVA
- License: Apache-2.0 license
- Description: The codebase is based on LLaVA code.
2. OpenClip
- Authors: Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt
- Source: https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K
- License: mit
- Description: The ViT is initialized using the weights of OpenClip.
## License
This project is licensed under the [yi-license](https://github.com/01-ai/Yi/blob/main/LICENSE). For more information on the license for this project, please see the LICENSE file in this repository.
## Notes
- This attribution does not claim to cover all open-source components used. Please check individual components and their respective licenses for full details.
- The use of the open-source components is subject to the terms and conditions of the respective licenses.
We appreciate the open-source community for their invaluable contributions to the technology world. |