File size: 4,375 Bytes
194526c
 
 
 
 
 
 
 
 
b70c5a1
194526c
 
 
b70c5a1
194526c
 
 
 
 
 
 
 
b70c5a1
194526c
b70c5a1
 
 
194526c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b70c5a1
 
194526c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b70c5a1
 
 
 
 
194526c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
license: cc
datasets:
- liuhaotian/LLaVA-Instruct-150K
- liuhaotian/LLaVA-Pretrain
language:
- en
---

# Model Card for LaViA-Llama-3-8b

<!-- Provide a quick summary of what the model is/does. -->

Please follow my github repo [LaViA](https://github.com/Victorwz/LaViA) for more details on fine-tuning LaViA model with Llama-3 as the foundatiaon LLM.

## Model Details
- Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//10).
- Template: We follow the LLaVA-v1 template for constructing the conversation.
- Architecture: LLaVA architecture, visual encoder + MLP adapter + LLM backbone

## How to Use

Please firstly install lavia via
```
git clone https://github.com/Victorwz/LaViA
cd LaViA-video-sft
pip install -e ./
```

You can load the model and perform inference as follows:
```python
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
from PIL import Image
import requests
import cv2
import torch
import base64
import io
from io import BytesIO
import numpy as np

# load model and processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = get_model_name_from_path("weizhiwang/weizhiwang/LaViA-Llama-38b")
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LaViA-Llama-38b", None, model_name, False, False, device=device)

# prepare image input
url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"

def read_video(video_url):
    response = requests.get(url)
    if response.status_code != 200:
        print("Failed to download video")
        exit()
    else:
        with open("tmp_video.mp4", 'wb') as f:
            for chunk in response.iter_content(chunk_size=1024):
                f.write(chunk)
    
    video = cv2.VideoCapture("tmp_video.mp4")

    base64Frames = []
    while video.isOpened():
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

    video.release()
    print(len(base64Frames), "frames read.")
    return base64Frames

video_frames = read_video(video_url=url)
image_tensors = []
samplng_interval = int(len(video_frames) / 10)
for i in range(0, len(video_frames), samplng_interval):
    rawbytes = base64.b64decode(video_frames[i])
    image = Image.open(io.BytesIO(rawbytes)).convert("RGB")
    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0].half().cuda()
    image_tensors.append(image_tensor)

# prepare inputs for the model
text = "\n".join(['<image>' for i in range(len(image_tensors))]) + '\n' + "Why is this video funny"
conv = conv_templates["llama_3"].copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()

# autoregressively generate text
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensors,
        do_sample=False,
        max_new_tokens=512,
        use_cache=True)

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print(outputs[0])
```
The image caption results look like:
```
The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood.
```

## Citation

```bibtex
@misc{wang2024LaViA,
      title={LaViA: Fine-Tuning Multimodal LLMs as Task Assistants with Video Instructions}, 
      url={https://github.com/Victorwz/LaViA},
      author={Wang, Weizhi and Luo, Xuan and Yan, Xifeng},
      year={2024},
}
```