|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- Qwen/Qwen2.5-3B-Instruct |
|
- google/siglip-so400m-patch14-384 |
|
tags: |
|
- multimodal |
|
- llava |
|
language: |
|
- en |
|
- zh |
|
pipeline_tag: visual-question-answering |
|
library_name: transformers |
|
--- |
|
|
|
![logo.jpg](logo.jpg) |
|
|
|
<code>Ivy-VL</code> is a lightweight multimodal model with only 3B parameters. |
|
|
|
It accepts both image and text inputs to generate text outputs. |
|
|
|
Thanks to its lightweight design, it can be deployed on edge devices such as AI glasses and smartphones, offering low memory usage and high speed while maintaining strong performance on multimodal tasks. Some well-known small models include [PaliGemma 3B](https://huggingface.co/google/paligemma-3b-mix-448), [Moondream2](https://huggingface.co/vikhyatk/moondream2), [Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B), [InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B), and [InternVL2_5-2B](https://huggingface.co/OpenGVLab/InternVL2_5-2B). Ivy-VL outperforms them on multiple benchmarks. |
|
|
|
# Model Summary: |
|
|
|
* Developed: AI Safeguard, CMU, Standford |
|
|
|
* Model type: Multi-modal model (image+text) |
|
|
|
* Language: Engligh and Chinese |
|
|
|
* License: Apache 2.0 |
|
|
|
* Architecture: Based on LLaVA-One-Vision |
|
|
|
* LLM: Qwen/Qwen2.5-3B-Instruct |
|
|
|
* Vision Encoder: google/siglip-so400m-patch14-384 |
|
|
|
* Notebook demo: [Ivy-VL-demo.ipynb](https://colab.research.google.com/drive/1D5_8sDRcP1HKlWtlqTH7s64xG8OH9NH0?usp=sharing) |
|
|
|
# Evaluation: |
|
|
|
![evaluation.jpg](evaluation.jpg) |
|
|
|
Most of the performance data comes from the VLMEvalKit leaderboard or the original papers. We conducted evaluations using VLMEvalKit. Due to differences in environments and the LLMs used for evaluation, there may be slight variations in performance. |
|
|
|
# How to use: |
|
|
|
|
|
```python |
|
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git |
|
from llava.model.builder import load_pretrained_model |
|
from llava.mm_utils import process_images, tokenizer_image_token |
|
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN |
|
from llava.conversation import conv_templates |
|
from PIL import Image |
|
import requests |
|
import copy |
|
import torch |
|
import warnings |
|
|
|
warnings.filterwarnings("ignore") |
|
|
|
pretrained = "AI-Safeguard/Ivy-VL-llava" |
|
|
|
model_name = "llava_qwen" |
|
device = "cuda" |
|
device_map = "auto" |
|
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args |
|
|
|
model.eval() |
|
|
|
# load image from url |
|
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
# load image from local environment |
|
# url = "./local_image.jpg" |
|
# image = Image.open(url) |
|
|
|
image_tensor = process_images([image], image_processor, model.config) |
|
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor] |
|
|
|
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models |
|
question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?" |
|
conv = copy.deepcopy(conv_templates[conv_template]) |
|
conv.append_message(conv.roles[0], question) |
|
conv.append_message(conv.roles[1], None) |
|
prompt_question = conv.get_prompt() |
|
|
|
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device) |
|
image_sizes = [image.size] |
|
|
|
cont = model.generate( |
|
input_ids, |
|
images=image_tensor, |
|
image_sizes=image_sizes, |
|
do_sample=False, |
|
temperature=0, |
|
max_new_tokens=4096, |
|
) |
|
|
|
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True) |
|
|
|
print(text_outputs) |
|
``` |
|
|
|
# Future Plan: |
|
|
|
* We plan to release more versions of LLMs in different sizes. |
|
|
|
* We will focus on improving the performance of the video modality. |
|
|
|
# Contact: |
|
Feel free to contact us if you have any questions or suggestions📧: |
|
* Email (Ivy Zhang): ivy.zhang@ai-safeguard.org |
|
|
|
# Citation: |
|
|
|
If you find our work helpful, please consider citing our Model: |
|
```plaintext |
|
@misc{ivy2024ivy-vl, |
|
title={Ivy-VL:Compact Vision-Language Models Achieving SOTA with Optimal Data}, |
|
url={https://huggingface.co/AI-Safeguard/Ivy-VL-llava}, |
|
author={Ivy Zhang,Wei Peng,Jenny N,Theresa Yu and David Qiu}, |
|
month={December}, |
|
year={2024} |
|
} |
|
``` |