YuanLiuuuuuu's picture
Update README.md
3829193 verified
metadata
license: apache-2.0
datasets:
  - HuggingFaceM4/MMBench
language:
  - en
base_model:
  - 01-ai/Yi-1.5-9B-Chat
  - openai/clip-vit-large-patch14-336
pipeline_tag: image-text-to-text
tags:
  - vision-language
  - multimodal

POINTS-Yi-1.5-9B-Chat

Introduction

We are excited to announce the first version of POINTS, which integrates recent advancement in vision-language model and new techniques proposed by researchers from WeChat AI.

🏠 Github   |    📑 Paper   

What's new in POINTS?

Key Innovations

  1. Strong Baseline: We integrate the most recent advancement in vision-language model, i.e., CapFusion, Dual Vision Encoder, and Dynamic High Resolution, into POINTS.

  2. Pre-training Dataset Filtering: We propose to filter the pre-training dataset using perplexity as a metric. Utilizing this filtering strategy, we can significantly reduce the size of the pre-training dataset and improve the performance of the model.

  3. Model Soup: We propose to apply model soup to models, fine-tuned with different visual instruction tuning datasets, which can further significantly improve the performance of the model.

How to use POINTS?

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import CLIPImageProcessor
from PIL import Image
import torch
import requests
from io import BytesIO


image_url = 'https://github.com/user-attachments/assets/83258e94-5d61-48ef-a87f-80dd9d895524'
response = requests.get(image_url)
image_data = BytesIO(response.content)
pil_image = Image.open(image_data)
prompt = 'please describe the image in detail'
model_path = 'WePOINTS/POINTS-Yi-1-5-9B-Chat'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True, device_map='cuda').to(torch.bfloat16)
image_processor = CLIPImageProcessor.from_pretrained(model_path)
generation_config = {
    'max_new_tokens': 1024,
    'temperature': 0.0,
    'top_p': 0.0,
    'num_beams': 1,
}
res = model.chat(
    pil_image,
    prompt,
    tokenizer,
    image_processor,
    True,
    generation_config
)
print(res)

Evaluation

Benchmark InternVL2-8B LLaVA-OneVision POINTS
MMBench-dev-en - 80.8 82.4
MathVista 58.3 62.3 63.0
HallucinationBench 45.0 31.6 47.8
OCRBench 79.4 62.2 71.9
AI2D 83.6 82.4 78.8
MMVet 54.3 51.9 49.2
MMStar 61.5 61.9 56.9
MMMU 51.2 47.9 47.6
ScienceQA 97.1 95.4 92.9
MME 2215.1 1993.6 2024.8
RealWorldQA 64.2 69.9 66.3
LLaVA-Wild 73.3 81.0 69.3

Citation

If you find our work helpful, feel free to cite us:

@article{liu2024points,
  title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
  author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
  journal={arXiv preprint arXiv:2409.04828},
  year={2024}
}

@article{liu2024rethinking,
  title={Rethinking Overlooked Aspects in Vision-Language Models},
  author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
  journal={arXiv preprint arXiv:2405.11850},
  year={2024}
}