license: apache-2.0
datasets:
- HuggingFaceM4/MMBench
language:
- en
base_model:
- 01-ai/Yi-1.5-9B-Chat
- openai/clip-vit-large-patch14-336
pipeline_tag: image-text-to-text
tags:
- vision-language
- multimodal
POINTS-Yi-1.5-9B-Chat
Introduction
We are excited to announce the first version of POINTS, which integrates recent advancement in vision-language model and new techniques proposed by researchers from WeChat AI.
What's new in POINTS?
Key Innovations
Strong Baseline: We integrate the most recent advancement in vision-language model, i.e., CapFusion, Dual Vision Encoder, and Dynamic High Resolution, into POINTS.
Pre-training Dataset Filtering: We propose to filter the pre-training dataset using perplexity as a metric. Utilizing this filtering strategy, we can significantly reduce the size of the pre-training dataset and improve the performance of the model.
Model Soup: We propose to apply model soup to models, fine-tuned with different visual instruction tuning datasets, which can further significantly improve the performance of the model.
How to use POINTS?
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import CLIPImageProcessor
from PIL import Image
import torch
import requests
from io import BytesIO
image_url = 'https://github.com/user-attachments/assets/83258e94-5d61-48ef-a87f-80dd9d895524'
response = requests.get(image_url)
image_data = BytesIO(response.content)
pil_image = Image.open(image_data)
prompt = 'please describe the image in detail'
model_path = 'WePOINTS/POINTS-Yi-1-5-9B-Chat'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, device_map='cuda').to(torch.bfloat16)
image_processor = CLIPImageProcessor.from_pretrained(model_path)
generation_config = {
'max_new_tokens': 1024,
'temperature': 0.0,
'top_p': 0.0,
'num_beams': 1,
}
res = model.chat(
pil_image,
prompt,
tokenizer,
image_processor,
True,
generation_config
)
print(res)
Evaluation
Benchmark | InternVL2-8B | LLaVA-OneVision | POINTS |
---|---|---|---|
MMBench-dev-en | - | 80.8 | 82.4 |
MathVista | 58.3 | 62.3 | 63.0 |
HallucinationBench | 45.0 | 31.6 | 47.8 |
OCRBench | 79.4 | 62.2 | 71.9 |
AI2D | 83.6 | 82.4 | 78.8 |
MMVet | 54.3 | 51.9 | 49.2 |
MMStar | 61.5 | 61.9 | 56.9 |
MMMU | 51.2 | 47.9 | 47.6 |
ScienceQA | 97.1 | 95.4 | 92.9 |
MME | 2215.1 | 1993.6 | 2024.8 |
RealWorldQA | 64.2 | 69.9 | 66.3 |
LLaVA-Wild | 73.3 | 81.0 | 69.3 |
Citation
If you find our work helpful, feel free to cite us:
@article{liu2024points,
title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
journal={arXiv preprint arXiv:2409.04828},
year={2024}
}
@article{liu2024rethinking,
title={Rethinking Overlooked Aspects in Vision-Language Models},
author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
journal={arXiv preprint arXiv:2405.11850},
year={2024}
}