BAAI
/

Image-Text-to-Text
Transformers
Safetensors
English
Chinese
multimodal
Inference Endpoints
gsh33's picture
Create README.md
31b0a03 verified
metadata
license: apache-2.0
language:
  - en
  - zh
tags:
  - multimodal
library_name: transformers
datasets:
  - BAAI/Infinity-MM
  - BAAI/Infinity-Instruct
  - BAAI/Infinity-Preference
base_model:
  - Qwen/Qwen2.5-1.5B-Instruct
  - google/siglip-so400m-patch14-384
pipeline_tag: image-text-to-text

Introduction

The Aquila-VL-2B model is a vision-language model (VLM) trained with open-sourced dataset Infinity-MM.

This repository is used to release intermediate checkpoints obtained during different stages of training. Please feel free to use these models for analysis and experimentation.

Evaluation

We evaluated the model using the VLMEvalKit tool. Whenever possible, we prioritized using the OpenAI API for test sets that support API-based evaluation.

benchmark 2-a 2-b 2-c 3 4 (final_model)
MMMUval 42.89 42.44 44.78 46.22 47.4
MMStar 45.80 49.33 51.73 53.73 54.9
MMBench_V1.1test 65.41 67.53 68.03 73.40 75.2
MathVistatestmini 48.60 52.40 54.30 60.10 59.0
HallusionBench 37.53 39.65 38.23 40.21 43.0
OCRBench 57.50 58.90 62.50 76.70 77.2
AI2Dtest 64.31 66.74 68.13 75.55 75.0
MMVet 36.24 36.97 39.68 38.35 44.3
Average 49.78 51.75 53.42 58.03 59.51

How to use

# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import requests
import copy
import torch
import warnings

warnings.filterwarnings("ignore")

pretrained = "BAAI/Aquila-VL-2B-llava-qwen"

model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args

model.eval()

# load image from url
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

# load image from local environment
# url = "./local_image.jpg"
# image = Image.open(url)

image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]

cont = model.generate(
    input_ids,
    images=image_tensor,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)

text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)

print(text_outputs)

Citation

If you find this useful, please cite the following work

@misc{gu2024infinitymmscalingmultimodalperformance,
      title={Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data}, 
      author={Shuhao Gu and Jialing Zhang and Siyuan Zhou and Kevin Yu and Zhaohu Xing and Liangdong Wang and Zhou Cao and Jintao Jia and Zhuoyi Zhang and Yixuan Wang and Zhenchong Hu and Bo-Wen Zhang and Jijie Li and Dong Liang and Yingli Zhao and Yulong Ao and Yaoqi Liu and Fangxiang Feng and Guang Liu},
      year={2024},
      eprint={2410.18558},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.18558}, 
}