WePOINTS
/

POINTS-Yi-1-5-9B-Chat

Image-Text-to-Text

vision-language

Model card Files Files and versions Community

POINTS-Yi-1-5-9B-Chat / README.md

YuanLiuuuuuu's picture

Update README.md

3829193 verified 3 months ago

|

history blame contribute delete

3.55 kB

	---
	license: apache-2.0
	datasets:
	- HuggingFaceM4/MMBench
	language:
	- en
	base_model:
	- 01-ai/Yi-1.5-9B-Chat
	- openai/clip-vit-large-patch14-336
	pipeline_tag: image-text-to-text
	tags:
	- vision-language
	- multimodal
	---
	## POINTS-Yi-1.5-9B-Chat

	### Introduction

	We are excited to announce the first version of POINTS, which integrates recent advancement in vision-language model and new techniques proposed by researchers from WeChat AI.

	<p align="center">
	🏠 <a href="https://github.com/WePOINTS/WePOINTS">Github</a>&nbsp&nbsp \| &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2409.04828">Paper</a> &nbsp&nbsp </a>
	</p>

	### What's new in POINTS?

	Key Innovations

	1. Strong Baseline: We integrate the most recent advancement in vision-language model, i.e., CapFusion, Dual Vision Encoder, and
	Dynamic High Resolution, into POINTS.

	2. Pre-training Dataset Filtering: We propose to filter the pre-training dataset using perplexity as a metric. Utilizing this filtering strategy, we can significantly reduce the size of the pre-training dataset and improve the performance of the model.

	3. Model Soup: We propose to apply model soup to models, fine-tuned with different visual instruction tuning datasets, which can further significantly improve the performance of the model.

	<p align="center">
	<img src="https://github.com/user-attachments/assets/6af35008-f501-400a-a870-b66a9bf2baab" width="60%"/>
	<p>


	### How to use POINTS?

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from transformers import CLIPImageProcessor
	from PIL import Image
	import torch
	import requests
	from io import BytesIO


	image_url = 'https://github.com/user-attachments/assets/83258e94-5d61-48ef-a87f-80dd9d895524'
	response = requests.get(image_url)
	image_data = BytesIO(response.content)
	pil_image = Image.open(image_data)
	prompt = 'please describe the image in detail'
	model_path = 'WePOINTS/POINTS-Yi-1-5-9B-Chat'
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForCausalLM.from_pretrained(
	model_path, trust_remote_code=True, device_map='cuda').to(torch.bfloat16)
	image_processor = CLIPImageProcessor.from_pretrained(model_path)
	generation_config = {
	'max_new_tokens': 1024,
	'temperature': 0.0,
	'top_p': 0.0,
	'num_beams': 1,
	}
	res = model.chat(
	pil_image,
	prompt,
	tokenizer,
	image_processor,
	True,
	generation_config
	)
	print(res)
	```

	### Evaluation

	\| Benchmark \| InternVL2-8B \| LLaVA-OneVision \| POINTS \|
	\| :-------: \| :----------: \| :-------------: \| :----: \|
	\| MMBench-dev-en \| - \| 80.8 \| 82.4 \|
	\| MathVista \| 58.3 \| 62.3 \| 63.0 \|
	\| HallucinationBench \| 45.0 \| 31.6 \| 47.8 \|
	\| OCRBench \| 79.4 \| 62.2 \| 71.9 \|
	\| AI2D \| 83.6 \| 82.4 \| 78.8 \|
	\| MMVet \| 54.3 \| 51.9 \| 49.2 \|
	\| MMStar \| 61.5 \| 61.9 \| 56.9 \|
	\| MMMU \| 51.2 \| 47.9 \| 47.6 \|
	\| ScienceQA \| 97.1 \| 95.4 \| 92.9 \|
	\| MME \| 2215.1 \| 1993.6 \| 2024.8 \|
	\| RealWorldQA \| 64.2 \| 69.9 \| 66.3 \|
	\| LLaVA-Wild \| 73.3 \| 81.0 \| 69.3 \|


	### Citation

	If you find our work helpful, feel free to cite us:

	```
	@article{liu2024points,
	title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
	author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
	journal={arXiv preprint arXiv:2409.04828},
	year={2024}
	}

	@article{liu2024rethinking,
	title={Rethinking Overlooked Aspects in Vision-Language Models},
	author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
	journal={arXiv preprint arXiv:2405.11850},
	year={2024}
	}
	```