weizhiwang
commited on
Commit
•
e4f28c4
1
Parent(s):
88dd034
Update README.md
Browse files
README.md
CHANGED
@@ -7,7 +7,7 @@ language:
|
|
7 |
- en
|
8 |
---
|
9 |
|
10 |
-
# Model Card for LLaVA-Video-LLaMA-3
|
11 |
|
12 |
<!-- Provide a quick summary of what the model is/does. -->
|
13 |
|
@@ -18,7 +18,9 @@ Please follow my github repo [LLaVA-Video-Llama-3](https://github.com/Victorwz/L
|
|
18 |
- [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
|
19 |
|
20 |
## Model Details
|
21 |
-
|
|
|
|
|
22 |
|
23 |
## How to Use
|
24 |
|
@@ -39,11 +41,11 @@ from io import BytesIO
|
|
39 |
|
40 |
# load model and processor
|
41 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
42 |
-
model_name = get_model_name_from_path("weizhiwang/
|
43 |
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/llava_llama3_8b_video", None, model_name, False, False, device=device)
|
44 |
|
45 |
# prepare inputs for the model
|
46 |
-
text = '<image>' + '\n' + "Describe the
|
47 |
conv = conv_templates["llama_3"].copy()
|
48 |
conv.append_message(conv.roles[0], text)
|
49 |
conv.append_message(conv.roles[1], None)
|
@@ -51,7 +53,7 @@ prompt = conv.get_prompt()
|
|
51 |
input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()
|
52 |
|
53 |
# prepare image input
|
54 |
-
url = "https://
|
55 |
response = requests.get(url)
|
56 |
image = Image.open(BytesIO(response.content)).convert('RGB')
|
57 |
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda()
|
@@ -70,9 +72,6 @@ print(outputs[0])
|
|
70 |
```
|
71 |
The image caption results look like:
|
72 |
```
|
73 |
-
The image features a blue and orange double-decker bus parked on a street. The bus is stopped at a bus stop, waiting for passengers to board. There are several people standing around the bus, some of them closer to the bus and others further away.
|
74 |
-
|
75 |
-
In the background, there are two cars parked on the street, one on the left side and the other on the right side. Additionally, there is a traffic light visible in the scene, indicating that the bus is stopped at an intersection.
|
76 |
```
|
77 |
|
78 |
# Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
|
@@ -82,8 +81,8 @@ Please refer to a forked [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA
|
|
82 |
## Citation
|
83 |
|
84 |
```bibtex
|
85 |
-
@misc{
|
86 |
-
title={LLaVA-Llama-3
|
87 |
author={Wang, Weizhi},
|
88 |
year={2024}
|
89 |
}
|
|
|
7 |
- en
|
8 |
---
|
9 |
|
10 |
+
# Model Card for LLaVA-Video-LLaMA-3
|
11 |
|
12 |
<!-- Provide a quick summary of what the model is/does. -->
|
13 |
|
|
|
18 |
- [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
|
19 |
|
20 |
## Model Details
|
21 |
+
- Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//15).
|
22 |
+
- Template: We follow the LLaVA-v1 template for constructing the conversation.
|
23 |
+
- Architecture: LLaVA architecture, visual encoder + MLP adapter + LLM backbone
|
24 |
|
25 |
## How to Use
|
26 |
|
|
|
41 |
|
42 |
# load model and processor
|
43 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
44 |
+
model_name = get_model_name_from_path("weizhiwang/LLaVA-Video-Llama-3")
|
45 |
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/llava_llama3_8b_video", None, model_name, False, False, device=device)
|
46 |
|
47 |
# prepare inputs for the model
|
48 |
+
text = '<image>' + '\n' + "Describe the video."
|
49 |
conv = conv_templates["llama_3"].copy()
|
50 |
conv.append_message(conv.roles[0], text)
|
51 |
conv.append_message(conv.roles[1], None)
|
|
|
53 |
input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()
|
54 |
|
55 |
# prepare image input
|
56 |
+
url = "https://github.com/PKU-YuanGroup/Video-LLaVA/blob/main/videollava/serve/examples/sample_demo_1.mp4"
|
57 |
response = requests.get(url)
|
58 |
image = Image.open(BytesIO(response.content)).convert('RGB')
|
59 |
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda()
|
|
|
72 |
```
|
73 |
The image caption results look like:
|
74 |
```
|
|
|
|
|
|
|
75 |
```
|
76 |
|
77 |
# Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
|
|
|
81 |
## Citation
|
82 |
|
83 |
```bibtex
|
84 |
+
@misc{wang2024llavavideollama3,
|
85 |
+
title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone},
|
86 |
author={Wang, Weizhi},
|
87 |
year={2024}
|
88 |
}
|