WePOINTS/POINTS-Yi-1-5-9B-Chat

POINTS-Yi-1.5-9B-Chat

Introduction

We are excited to announce the first version of POINTS, which integrates recent advancement in vision-language model and new techniques proposed by researchers from WeChat AI.

🏠 Github | 📑 Paper

What's new in POINTS?

Key Innovations

Strong Baseline: We integrate the most recent advancement in vision-language model, i.e., CapFusion, Dual Vision Encoder, and Dynamic High Resolution, into POINTS.
Pre-training Dataset Filtering: We propose to filter the pre-training dataset using perplexity as a metric. Utilizing this filtering strategy, we can significantly reduce the size of the pre-training dataset and improve the performance of the model.
Model Soup: We propose to apply model soup to models, fine-tuned with different visual instruction tuning datasets, which can further significantly improve the performance of the model.

How to use POINTS?

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import CLIPImageProcessor
from PIL import Image
import torch
import requests
from io import BytesIO


image_url = 'https://github.com/user-attachments/assets/83258e94-5d61-48ef-a87f-80dd9d895524'
response = requests.get(image_url)
image_data = BytesIO(response.content)
pil_image = Image.open(image_data)
prompt = 'please describe the image in detail'
model_path = 'WePOINTS/POINTS-Yi-1-5-9B-Chat'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True, device_map='cuda').to(torch.bfloat16)
image_processor = CLIPImageProcessor.from_pretrained(model_path)
generation_config = {
    'max_new_tokens': 1024,
    'temperature': 0.0,
    'top_p': 0.0,
    'num_beams': 1,
}
res = model.chat(
    pil_image,
    prompt,
    tokenizer,
    image_processor,
    True,
    generation_config
)
print(res)

Evaluation

Benchmark	InternVL2-8B	LLaVA-OneVision	POINTS
MMBench-dev-en	-	80.8	82.4
MathVista	58.3	62.3	63.0
HallucinationBench	45.0	31.6	47.8
OCRBench	79.4	62.2	71.9
AI2D	83.6	82.4	78.8
MMVet	54.3	51.9	49.2
MMStar	61.5	61.9	56.9
MMMU	51.2	47.9	47.6
ScienceQA	97.1	95.4	92.9
MME	2215.1	1993.6	2024.8
RealWorldQA	64.2	69.9	66.3
LLaVA-Wild	73.3	81.0	69.3

Citation

If you find our work helpful, feel free to cite us:

@article{liu2024points,
  title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
  author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
  journal={arXiv preprint arXiv:2409.04828},
  year={2024}
}

@article{liu2024rethinking,
  title={Rethinking Overlooked Aspects in Vision-Language Models},
  author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
  journal={arXiv preprint arXiv:2405.11850},
  year={2024}
}

WePOINTS
/

POINTS-Yi-1-5-9B-Chat