This is a Hugging Face friendly Model, the original can be found at https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-preview

LLaVA 13B Model Card

Model details

Model type: LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

Model date: LLaVA-LLaMA-2-13B-Chat-Preview was trained in July 2023.

Paper or resources for more information: https://llava-vl.github.io/

License

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Where to send questions or comments about the model: https://github.com/haotian-liu/LLaVA/issues

Intended use

Primary intended uses: The primary use of LLaVA is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

  • 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
  • 80K GPT-generated multimodal instruction-following data.

Evaluation dataset

A preliminary evaluation of the model quality is conducted by creating a set of 90 visual reasoning questions from 30 unique images randomly sampled from COCO val 2014 and each is associated with three types of questions: conversational, detailed description, and complex reasoning. We utilize GPT-4 to judge the model outputs. We also evaluate our model on the ScienceQA dataset. Our synergy with GPT-4 sets a new state-of-the-art on the dataset. See https://llava-vl.github.io/ for more details.

Usage

usage is as follows

from transformers import LlavaProcessor, LlavaForCausalLM
from PIL import Image
import requests
import torch

PATH_TO_CONVERTED_WEIGHTS = "shauray/Llava-Llama-2-13B-hf"

model = LlavaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS,
device_map="cuda",torch_dtype=torch.float16).to("cuda")
processor = LlavaProcessor.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)

url = "https://llava-vl.github.io/static/images/view.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = "How can you best describe this image?"

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda",
torch.float16)
# Generate
generate_ids = model.generate(**inputs, 
    do_sample=True,
    max_length=1024,
    temperature=0.1,
    top_p=0.9,
)
out = processor.decode(generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip()

print(out)

"""The photograph shows a wooden dock floating on the water, with mountains in the background. It is an idyllic scene that captures both
nature and human-made structures at their finest moments of beauty or tranquility depending upon one's perspective as they gaze into it"""
Downloads last month
10
Inference Examples
Inference API (serverless) has been turned off for this model.