FireLLaVA-13b / README.md
websterbei's picture
Update README.md
2a8a538 verified
metadata
license: llama2

Model Details

Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the website and accept the Llama 2 Community License Agreement before requesting access here.

Model type:

LLaVA vision-language model trained on OSS LLM generated instruction following data.

Model state:

FireLLaVA 13B was trained in December 2023

Paper or resources for more information:

https://llava-vl.github.io/

How to use the model

The model is served on Fireworks.ai, and you can try it out here: https://app.fireworks.ai/models/fireworks/firellava-13b API endpoints are also available with instructions linked here: https://readme.fireworks.ai/docs/querying-vision-language-models

Otherwise, if you wish to run the model locally using huggingface transformers library, you can do so, please read the instructions below. First, make sure to have transformers >= 4.35.3. The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template (USER: xxx\nASSISTANT:) and add the token <image> to the location where you want to query images. However, do note that model performance with multiple images in the input may degrade since it is not trained with multiple images in the input.

Using pipeline

from transformers import pipeline
from PIL import Image    
import requests

model_id = "fireworks-ai/FireLLaVA-13b"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> [{'generated_text': 'USER:  \nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT: Volkswagen'}]

Using pure transformers

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "fireworks-ai/FireLLaVA-13b"

prompt = "USER: <image>\nWhat is this?\n\nASSISTANT:"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16,
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

raw_image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0], skip_special_tokens=True))
>>> "This is an early Volkswagen Beetle car, also known as a VW bug, parked on a brick street and next to a building with doors ..."