File size: 4,915 Bytes
7d92287
 
 
 
 
 
 
31f34cb
7d92287
 
e4f28c4
7d92287
 
 
88dd034
7d92287
 
 
 
 
 
9c2e917
e4f28c4
 
7d92287
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c2e917
7d92287
9c2e917
 
7d92287
9c2e917
7d92287
 
 
e4f28c4
9c2e917
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d92287
 
9c2e917
7d92287
 
 
 
 
 
 
 
 
 
9c2e917
7d92287
 
 
 
9c2e917
7d92287
 
 
 
e5847bf
7d92287
 
 
 
 
 
 
 
 
e4f28c4
 
7d92287
 
 
31f34cb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
license: cc
datasets:
- liuhaotian/LLaVA-Instruct-150K
- liuhaotian/LLaVA-Pretrain
language:
- en
pipeline_tag: video-text-to-text
---

# Model Card for LLaVA-Video-LLaMA-3

<!-- Provide a quick summary of what the model is/does. -->

Please follow my github repo [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3/) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM.

## Updates 
- [6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks.
- [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.

## Model Details
- Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//10).
- Template: We follow the LLaVA-v1 template for constructing the conversation.
- Architecture: LLaVA architecture, visual encoder + MLP adapter + LLM backbone

## How to Use

Please firstly install llava via
```
pip install git+https://github.com/Victorwz/LLaVA-Video-Llama-3.git
```

You can load the model and perform inference as follows:
```python
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
from PIL import Image
import requests
import cv2
import torch
import base64
import io
from io import BytesIO
import numpy as np

# load model and processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = get_model_name_from_path("weizhiwang/LLaVA-Video-Llama-3")
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LLaVA-Video-Llama-3", None, model_name, False, False, device=device)

# prepare image input
url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"

def read_video(video_url):
    response = requests.get(url)
    if response.status_code != 200:
        print("Failed to download video")
        exit()
    else:
        with open("tmp_video.mp4", 'wb') as f:
            for chunk in response.iter_content(chunk_size=1024):
                f.write(chunk)
    
    video = cv2.VideoCapture("tmp_video.mp4")

    base64Frames = []
    while video.isOpened():
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

    video.release()
    print(len(base64Frames), "frames read.")
    return base64Frames

video_frames = read_video(video_url=url)
image_tensors = []
samplng_interval = int(len(video_frames) / 10)
for i in range(0, len(video_frames), samplng_interval):
    rawbytes = base64.b64decode(video_frames[i])
    image = Image.open(io.BytesIO(rawbytes)).convert("RGB")
    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0].half().cuda()
    image_tensors.append(image_tensor)

# prepare inputs for the model
text = "\n".join(['<image>' for i in range(len(image_tensors))]) + '\n' + "Why is this video funny"
conv = conv_templates["llama_3"].copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()

# autoregressively generate text
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensors,
        do_sample=False,
        max_new_tokens=512,
        use_cache=True)

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print(outputs[0])
```
The image caption results look like:
```
The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood.
```

# Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
Please refer to a forked [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer.


## Citation

```bibtex
@misc{wang2024llavavideollama3,
  title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone},
  author={Wang, Weizhi},
  year={2024}
}
```