|
--- |
|
license: llama2 |
|
--- |
|
# Model Details |
|
Note: Use of this model is governed by the Meta license. In order to download the model weights and tokenizer, please visit the [website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and accept the Llama 2 Community License Agreement before requesting access here. |
|
|
|
## Model type: |
|
LLaVA vision-language model trained on OSS LLM generated instruction following data. |
|
|
|
## Model state: |
|
FireLLaVA 13B was trained in December 2023 |
|
|
|
## Paper or resources for more information: |
|
https://llava-vl.github.io/ |
|
|
|
# How to use the model |
|
The model is served on Fireworks.ai, and you can try it out here: https://app.fireworks.ai/models/fireworks/firellava-13b |
|
API endpoints are also available with instructions linked here: https://readme.fireworks.ai/docs/querying-vision-language-models |
|
|
|
Otherwise, if you wish to run the model locally using huggingface transformers library, you can do so, please read the instructions below. |
|
First, make sure to have transformers >= 4.35.3. The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template (USER: xxx\nASSISTANT:) and add the token \<image\> to the location where you want to query images. |
|
However, do note that model performance with multiple images in the input may degrade since it is not trained with multiple images in the input. |
|
|
|
## Using `pipeline` |
|
```python |
|
from transformers import pipeline |
|
from PIL import Image |
|
import requests |
|
|
|
model_id = "fireworks-ai/FireLLaVA-13b" |
|
pipe = pipeline("image-to-text", model=model_id) |
|
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg" |
|
|
|
image = Image.open(requests.get(url, stream=True).raw) |
|
prompt = "USER: <image>\nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT:" |
|
|
|
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200}) |
|
print(outputs) |
|
>>> [{'generated_text': 'USER: \nWhat is the make of the car? Answer with one word or phrase.\n\nASSISTANT: Volkswagen'}] |
|
``` |
|
|
|
## Using pure `transformers` |
|
```python |
|
import requests |
|
from PIL import Image |
|
|
|
import torch |
|
from transformers import AutoProcessor, LlavaForConditionalGeneration |
|
|
|
model_id = "fireworks-ai/FireLLaVA-13b" |
|
|
|
prompt = "USER: <image>\nWhat is this?\n\nASSISTANT:" |
|
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg" |
|
|
|
model = LlavaForConditionalGeneration.from_pretrained( |
|
model_id, |
|
torch_dtype=torch.float16, |
|
).to(0) |
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
raw_image = Image.open(requests.get(url, stream=True).raw) |
|
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16) |
|
|
|
output = model.generate(**inputs, max_new_tokens=200, do_sample=False) |
|
print(processor.decode(output[0], skip_special_tokens=True)) |
|
>>> "This is an early Volkswagen Beetle car, also known as a VW bug, parked on a brick street and next to a building with doors ..." |
|
``` |