Edit model card

Generic badge

Model

llava-phi-3-mini is a LLaVA model fine-tuned from microsoft/Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.

Note: This model is in HuggingFace LLaVA format.

Resources:

Details

Model Visual Encoder Projector Resolution Pretraining Strategy Fine-tuning Strategy Pretrain Dataset Fine-tune Dataset Pretrain Epoch Fine-tune Epoch
LLaVA-v1.5-7B CLIP-L MLP 336 Frozen LLM, Frozen ViT Full LLM, Frozen ViT LLaVA-PT (558K) LLaVA-Mix (665K) 1 1
LLaVA-Llama-3-8B CLIP-L MLP 336 Frozen LLM, Frozen ViT Full LLM, LoRA ViT LLaVA-PT (558K) LLaVA-Mix (665K) 1 1
LLaVA-Llama-3-8B-v1.1 CLIP-L MLP 336 Frozen LLM, Frozen ViT Full LLM, LoRA ViT ShareGPT4V-PT (1246K) InternVL-SFT (1268K) 1 1
LLaVA-Phi-3-mini CLIP-L MLP 336 Frozen LLM, Frozen ViT Full LLM, Full ViT ShareGPT4V-PT (1246K) InternVL-SFT (1268K) 1 2

Results

Image
Model MMBench Test (EN) MMMU Val SEED-IMG AI2D Test ScienceQA Test HallusionBench aAcc POPE GQA TextVQA MME MMStar
LLaVA-v1.5-7B 66.5 35.3 60.5 54.8 70.4 44.9 85.9 62.0 58.2 1511/348 30.3
LLaVA-Llama-3-8B 68.9 36.8 69.8 60.9 73.3 47.3 87.2 63.5 58.0 1506/295 38.2
LLaVA-Llama-3-8B-v1.1 72.3 37.1 70.1 70.0 72.9 47.7 86.4 62.6 59.0 1469/349 45.1
LLaVA-Phi-3-mini 69.2 41.4 70.0 69.3 73.7 49.8 87.3 61.5 57.8 1477/313 43.7

Quickstart

Chat by pipeline

from transformers import pipeline
from PIL import Image    
import requests

model_id = "xtuner/llava-phi-3-mini-hf"
pipe = pipeline("image-to-text", model=model_id, device=0)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = "<|user|>\n<image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud<|end|>\n<|assistant|>\n"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> [{'generated_text': '\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud (1) lava'}]

Chat by pure transformers

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "xtuner/llava-phi-3-mini-hf"

prompt = "<|user|>\n<image>\nWhat are these?<|end|>\n<|assistant|>\n"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)


raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
>>> What are these? These are two cats sleeping on a pink couch.

Reproduce

Please refer to docs.

Citation

@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}
Downloads last month
5,172
Safetensors
Model size
4.14B params
Tensor type
FP16
Β·
F32
Β·
Inference API
This model can be loaded on Inference API (serverless).

Dataset used to train xtuner/llava-phi-3-mini-hf

Spaces using xtuner/llava-phi-3-mini-hf 5

Collection including xtuner/llava-phi-3-mini-hf