|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- HuggingFaceM4/MMBench |
|
language: |
|
- en |
|
base_model: |
|
- 01-ai/Yi-1.5-9B-Chat |
|
- openai/clip-vit-large-patch14-336 |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- vision-language |
|
- multimodal |
|
--- |
|
## POINTS-Yi-1.5-9B-Chat |
|
|
|
### Introduction |
|
|
|
We are excited to announce the first version of POINTS, which integrates recent advancement in vision-language model and new techniques proposed by researchers from WeChat AI. |
|
|
|
<p align="center"> |
|
π <a href="https://github.com/WePOINTS/WePOINTS">Github</a>   |    π <a href="https://arxiv.org/abs/2409.04828">Paper</a>    </a> |
|
</p> |
|
|
|
### What's new in POINTS? |
|
|
|
**Key Innovations** |
|
|
|
1. **Strong Baseline**: We integrate the most recent advancement in vision-language model, i.e., CapFusion, Dual Vision Encoder, and |
|
Dynamic High Resolution, into POINTS. |
|
|
|
2. **Pre-training Dataset Filtering**: We propose to filter the pre-training dataset using perplexity as a metric. Utilizing this filtering strategy, we can significantly reduce the size of the pre-training dataset and improve the performance of the model. |
|
|
|
3. **Model Soup**: We propose to apply model soup to models, fine-tuned with different visual instruction tuning datasets, which can further significantly improve the performance of the model. |
|
|
|
<p align="center"> |
|
<img src="https://github.com/user-attachments/assets/6af35008-f501-400a-a870-b66a9bf2baab" width="60%"/> |
|
<p> |
|
|
|
|
|
### How to use POINTS? |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from transformers import CLIPImageProcessor |
|
from PIL import Image |
|
import torch |
|
import requests |
|
from io import BytesIO |
|
|
|
|
|
image_url = 'https://github.com/user-attachments/assets/83258e94-5d61-48ef-a87f-80dd9d895524' |
|
response = requests.get(image_url) |
|
image_data = BytesIO(response.content) |
|
pil_image = Image.open(image_data) |
|
prompt = 'please describe the image in detail' |
|
model_path = 'WePOINTS/POINTS-Yi-1-5-9B-Chat' |
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_path, trust_remote_code=True, device_map='cuda').to(torch.bfloat16) |
|
image_processor = CLIPImageProcessor.from_pretrained(model_path) |
|
generation_config = { |
|
'max_new_tokens': 1024, |
|
'temperature': 0.0, |
|
'top_p': 0.0, |
|
'num_beams': 1, |
|
} |
|
res = model.chat( |
|
pil_image, |
|
prompt, |
|
tokenizer, |
|
image_processor, |
|
True, |
|
generation_config |
|
) |
|
print(res) |
|
``` |
|
|
|
### Evaluation |
|
|
|
| Benchmark | InternVL2-8B | LLaVA-OneVision | POINTS | |
|
| :-------: | :----------: | :-------------: | :----: | |
|
| MMBench-dev-en | - | 80.8 | 82.4 | |
|
| MathVista | 58.3 | 62.3 | 63.0 | |
|
| HallucinationBench | 45.0 | 31.6 | 47.8 | |
|
| OCRBench | 79.4 | 62.2 | 71.9 | |
|
| AI2D | 83.6 | 82.4 | 78.8 | |
|
| MMVet | 54.3 | 51.9 | 49.2 | |
|
| MMStar | 61.5 | 61.9 | 56.9 | |
|
| MMMU | 51.2 | 47.9 | 47.6 | |
|
| ScienceQA | 97.1 | 95.4 | 92.9 | |
|
| MME | 2215.1 | 1993.6 | 2024.8 | |
|
| RealWorldQA | 64.2 | 69.9 | 66.3 | |
|
| LLaVA-Wild | 73.3 | 81.0 | 69.3 | |
|
|
|
|
|
### Citation |
|
|
|
If you find our work helpful, feel free to cite us: |
|
|
|
``` |
|
@article{liu2024points, |
|
title={POINTS: Improving Your Vision-language Model with Affordable Strategies}, |
|
author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie}, |
|
journal={arXiv preprint arXiv:2409.04828}, |
|
year={2024} |
|
} |
|
|
|
@article{liu2024rethinking, |
|
title={Rethinking Overlooked Aspects in Vision-Language Models}, |
|
author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie}, |
|
journal={arXiv preprint arXiv:2405.11850}, |
|
year={2024} |
|
} |
|
``` |