metadata
datasets:
- lmms-lab/LLaVA-OneVision-Data
language:
- en
- zh
library_name: transformers
license: apache-2.0
metrics:
- accuracy
tags:
- multimodal
LLaVA-OneVision
Play with the model on the LLaVA OneVision Chat.
Table of Contents
Model Summary
The LLaVA-OneVision models are 0.5/7/72B parameter models trained on LLaVA-OneVision, based on Qwen2 language model with a context window of 32K tokens.
The model with -chat
postfix denotes the model with iterative DPO training with human preference and suitable for chat usage. Our research shows we can improve the model's chat ability meanwhile maintain other instruction-following abilities through our iterative DPO training recipe.
- Repository: LLaVA-VL/LLaVA-NeXT
- Project Website: llava-onevision.lmms-lab.com
- Paper: LLaVA-OneVision
- Point of Contact: Bo Li
- Languages: English, Chinese
Use
Intended use
The model was trained on LLaVA-OneVision Dataset and have the ability to interact with images, multi-image and videos.
Feel free to share your generations in the Community tab!
Generation
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-si"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
model.eval()
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]
cont = model.generate(
input_ids,
images=image_tensor,
image_sizes=image_sizes,
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)
Training
Model
- Architecture: SO400M + Qwen2
- Pretraining Stage: LCS-558K, 1 epoch, projector
- Mid Stage: A mixture of 4.7M high-quality synthetic data, 1 epoch, full model
- Final-Image Stage: A mixture of 3.6M single-image data, 1 epoch, full model
- OneVision Stage: A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
- Precision: bfloat16
Hardware & Software
- GPUs: 256 * Nvidia Tesla A100 (for whole model series training)
- Orchestration: Huggingface Trainer
- Neural networks: PyTorch
Citation
@article{li2024llavaonevision,
title={LLaVA-OneVision},
}