metadata
datasets:
- liuhaotian/LLaVA-Pretrain
- liuhaotian/LLaVA-Instruct-150K
language:
- en
tags:
- llava
- phi
license: mit
library_name: transformers
widget:
- text: What animal is it?
src: >-
https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
- text: Where is it?
src: >-
https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
LLaVA-3b
Model details
LLaVA-3b is a model fine-tuned from Dolphin 2.6 Phi in a LLaVA fashion using vision tower from SigLIP 400M. There are a couple of things different from the original LLaVA architecture:
- Multiple image tokens. The multimodal projector generates embeddings of shape [5, 2560] instead of [1, 2560] for images. The idea is that using more tokens allows us to get more info from the image into the language model.
- The model uses the output from the latest layer of the vision encoder instead of the intermediate one.
- The context length during training was 1200 tokens, as the L4 GPUs I used didn't allow me to get more.
As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
<|im_start|>system
You are Dolphin, a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
How to use
Install dependencies
!pip install -q open_clip_torch timm einops
Download modeling files
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)
Create a model
from modeling_llava import LlavaForConditionalGeneration
import torch
model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b", torch_dtype=torch.float16)
model = model.to("cuda")
Create processors
from transformers import AutoTokenizer
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor
tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
processor = LlavaProcessor(image_processor, tokenizer)
Set image and text
from PIL import Image
import requests
image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
prompt = """<|im_start|>system
A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.
The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
<|im_start|>user
<image>
Describe the image.<|im_end|>
<|im_start|>assistant
"""
Process inputs
inputs = processor(prompt, raw_image, model, return_tensors='pt')
inputs['input_ids'] = inputs['input_ids'].to(model.device)
inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
Generate the data
output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.5, temperature=1.2, eos_token_id=tokenizer.eos_token_id)
Benchmarks
- TextVQA - 33.25%
- GQA - 47.15%
- VQAv2 - 63.1%
- VizWiz - 24.03%
Acknowledgments
Thanks to ML Collective for providing credits for computing resources.