metadata

license: llama2
pipeline_tag: image-text-to-text
language:
  - en

LLaVA-NeXT-Video Model Card

Below is the model card of LLaVa-NeXT-Video model 7b, which is copied from the original Llava model card that you can find here.

Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance:

Or check out our Spaces demo!

Model details

Model type:
LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
Base LLM: lmsys/vicuna-7b-v1.5

Model date:
LLaVA-Next-Video-7B was trained in April 2024.

Paper or resources for more information:
https://github.com/LLaVA-VL/LLaVA-NeXT

How to use the model

First, make sure to have transformers >= 4.42.0. The model supports multi-visual and multi-prompt generation. Meaning that you can pass multiple images/videos in your prompt. Make sure also to follow the correct prompt template (USER: xxx\nASSISTANT:) and add the token <image> or <video> to the location where you want to query images/videos:

Below is an example script to run generation in float16 precision on a GPU device:

import requests
from PIL import Image
import av
import torch
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration

model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"

prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = LlavaNextVideoProcessor.from_pretrained(model_id)

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

prompt = "USER: <video>\nWhy is this video funny? ASSISTANT:"
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)

# sample uniformly 8 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)

output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

Inference with images as inputs

To generate from images use the below code after loading the model as shown above:

raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

Inference with images and videos as inputs

To generate from images and videos in one generate use the below code after loading the model as shown above:

prompts = [
  "USER: <image>\nWhat's the content of the image? ASSISTANT:",
  "USER: <video>\nWhy is this video funny? ASSISTANT:"
]
inputs = processor(text=prompts, images=image, videos=clip, padding=True, return_tensors="pt").to(model.device)

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=100)
out = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(out)

Model optimization

4-bit quantization through `bitsandbytes` library

First make sure to install bitsandbytes, pip install bitsandbytes and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_in_4bit=True
)

Use Flash-Attention 2 to further speed-up generation

First make sure to install flash-attn. Refer to the original repository of Flash Attention regarding that package installation. Simply change the snippet above with:

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attention_2=True
).to(0)

License

Intended use

Primary intended uses:
The primary use of LLaVA is research on large multimodal models and chatbots.

Primary intended users:
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

Image

558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
158K GPT-generated multimodal instruction-following data.
500K academic-task-oriented VQA data mixture.
50K GPT-4V data mixture.
40K ShareGPT data.

Video

100K VideoChatGPT-Instruct.

Evaluation dataset

A collection of 4 benchmarks, including 3 academic VQA benchmarks and 1 captioning benchmark.