weizhiwang
commited on
Commit
•
7d92287
1
Parent(s):
883ca64
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc
|
3 |
+
datasets:
|
4 |
+
- liuhaotian/LLaVA-Instruct-150K
|
5 |
+
- liuhaotian/LLaVA-Pretrain
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
---
|
9 |
+
|
10 |
+
# Model Card for LLaVA-Video-LLaMA-3-8B
|
11 |
+
|
12 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
13 |
+
|
14 |
+
A reproduced LLaVA LVLM based on Llama-3-8B LLM backbone. Not an official implementation.
|
15 |
+
Please follow my reproduced implementation [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3/) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM.
|
16 |
+
|
17 |
+
## Updates
|
18 |
+
- [6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks.
|
19 |
+
- [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
|
20 |
+
|
21 |
+
## Model Details
|
22 |
+
Follows LLavA-1.5 pre-train and supervised fine-tuning pipeline. You do not need to change the LLaVA codebase to accommodate Llama-3.
|
23 |
+
|
24 |
+
## How to Use
|
25 |
+
|
26 |
+
Please firstly install llava via
|
27 |
+
```
|
28 |
+
pip install git+https://github.com/Victorwz/LLaVA-Video-Llama-3.git
|
29 |
+
```
|
30 |
+
|
31 |
+
You can load the model and perform inference as follows:
|
32 |
+
```python
|
33 |
+
from llava.conversation import conv_templates, SeparatorStyle
|
34 |
+
from llava.model.builder import load_pretrained_model
|
35 |
+
from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
|
36 |
+
from PIL import Image
|
37 |
+
import requests
|
38 |
+
import torch
|
39 |
+
from io import BytesIO
|
40 |
+
|
41 |
+
# load model and processor
|
42 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
43 |
+
model_name = get_model_name_from_path("weizhiwang/llava_llama3_8b_video")
|
44 |
+
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/llava_llama3_8b_video", None, model_name, False, False, device=device)
|
45 |
+
|
46 |
+
# prepare inputs for the model
|
47 |
+
text = '<image>' + '\n' + "Describe the image."
|
48 |
+
conv = conv_templates["llama_3"].copy()
|
49 |
+
conv.append_message(conv.roles[0], text)
|
50 |
+
conv.append_message(conv.roles[1], None)
|
51 |
+
prompt = conv.get_prompt()
|
52 |
+
input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()
|
53 |
+
|
54 |
+
# prepare image input
|
55 |
+
url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
|
56 |
+
response = requests.get(url)
|
57 |
+
image = Image.open(BytesIO(response.content)).convert('RGB')
|
58 |
+
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda()
|
59 |
+
|
60 |
+
# autoregressively generate text
|
61 |
+
with torch.inference_mode():
|
62 |
+
output_ids = model.generate(
|
63 |
+
input_ids,
|
64 |
+
images=image_tensor,
|
65 |
+
do_sample=False,
|
66 |
+
max_new_tokens=512,
|
67 |
+
use_cache=True)
|
68 |
+
|
69 |
+
outputs = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:], skip_special_tokens=True)
|
70 |
+
print(outputs[0])
|
71 |
+
```
|
72 |
+
The image caption results look like:
|
73 |
+
```
|
74 |
+
The image features a blue and orange double-decker bus parked on a street. The bus is stopped at a bus stop, waiting for passengers to board. There are several people standing around the bus, some of them closer to the bus and others further away.
|
75 |
+
|
76 |
+
In the background, there are two cars parked on the street, one on the left side and the other on the right side. Additionally, there is a traffic light visible in the scene, indicating that the bus is stopped at an intersection.
|
77 |
+
```
|
78 |
+
|
79 |
+
# Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
|
80 |
+
Please refer to a forked [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer.
|
81 |
+
|
82 |
+
|
83 |
+
## Citation
|
84 |
+
|
85 |
+
```bibtex
|
86 |
+
@misc{wang2024llavallama3,
|
87 |
+
title={LLaVA-Llama-3-8B: A reproduction towards LLaVA-v1.5 based on Llama-3-8B LLM backbone},
|
88 |
+
author={Wang, Weizhi},
|
89 |
+
year={2024}
|
90 |
+
}
|
91 |
+
```
|