ZhangYuanhan
commited on
Commit
•
d593f18
1
Parent(s):
eb11cb4
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,227 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- lmms-lab/LLaVA-NeXT-Video-SFT-Data
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
library_name: transformers
|
7 |
+
license: apache-2.0
|
8 |
+
metrics:
|
9 |
+
- accuracy
|
10 |
+
tags:
|
11 |
+
- multimodal
|
12 |
+
model-index:
|
13 |
+
- name: LLaVA-NeXT-Video-7B-Qwen2
|
14 |
+
results:
|
15 |
+
- task:
|
16 |
+
type: multimodal
|
17 |
+
dataset:
|
18 |
+
name: ActNet-QA
|
19 |
+
type: actnet-qa
|
20 |
+
metrics:
|
21 |
+
- type: accuracy
|
22 |
+
value: 56.5
|
23 |
+
name: accuracy
|
24 |
+
verified: true
|
25 |
+
- task:
|
26 |
+
type: multimodal
|
27 |
+
dataset:
|
28 |
+
name: EgoSchema
|
29 |
+
type: egoschema
|
30 |
+
metrics:
|
31 |
+
- type: accuracy
|
32 |
+
value: 57.3
|
33 |
+
name: accuracy
|
34 |
+
verified: true
|
35 |
+
- task:
|
36 |
+
type: multimodal
|
37 |
+
dataset:
|
38 |
+
name: MLVU
|
39 |
+
type: mlvu
|
40 |
+
metrics:
|
41 |
+
- type: accuracy
|
42 |
+
value: 70.8
|
43 |
+
name: accuracy
|
44 |
+
verified: true
|
45 |
+
- task:
|
46 |
+
type: multimodal
|
47 |
+
dataset:
|
48 |
+
name: MVBench
|
49 |
+
type: mvbench
|
50 |
+
metrics:
|
51 |
+
- type: accuracy
|
52 |
+
value: 58.6
|
53 |
+
name: accuracy
|
54 |
+
verified: true
|
55 |
+
- task:
|
56 |
+
type: multimodal
|
57 |
+
dataset:
|
58 |
+
name: NextQA
|
59 |
+
type: nextqa
|
60 |
+
metrics:
|
61 |
+
- type: accuracy
|
62 |
+
value: 83.2
|
63 |
+
name: accuracy
|
64 |
+
verified: true
|
65 |
+
- task:
|
66 |
+
type: multimodal
|
67 |
+
dataset:
|
68 |
+
name: PercepTest
|
69 |
+
type: percepTest
|
70 |
+
metrics:
|
71 |
+
- type: accuracy
|
72 |
+
value: 67.9
|
73 |
+
name: accuracy
|
74 |
+
verified: true
|
75 |
+
- task:
|
76 |
+
type: multimodal
|
77 |
+
dataset:
|
78 |
+
name: VideoChatGPT
|
79 |
+
type: videochatgpt
|
80 |
+
metrics:
|
81 |
+
- type: score
|
82 |
+
value: 3.52
|
83 |
+
name: score
|
84 |
+
verified: true
|
85 |
+
- task:
|
86 |
+
type: multimodal
|
87 |
+
dataset:
|
88 |
+
name: VideoDC
|
89 |
+
type: videodc
|
90 |
+
metrics:
|
91 |
+
- type: score
|
92 |
+
value: 3.66
|
93 |
+
name: score
|
94 |
+
verified: true
|
95 |
+
- task:
|
96 |
+
type: multimodal
|
97 |
+
dataset:
|
98 |
+
name: LongVideoBench
|
99 |
+
type: longvideobench
|
100 |
+
metrics:
|
101 |
+
- type: accuracy
|
102 |
+
value: 58.2
|
103 |
+
name: accuracy
|
104 |
+
verified: true
|
105 |
+
- task:
|
106 |
+
type: multimodal
|
107 |
+
dataset:
|
108 |
+
name: VideoMME
|
109 |
+
type: videomme
|
110 |
+
metrics:
|
111 |
+
- type: accuracy
|
112 |
+
value: 63.3
|
113 |
+
name: accuracy
|
114 |
+
verified: true
|
115 |
+
---
|
116 |
+
|
117 |
+
# LLaVA-NeXT-Video
|
118 |
+
|
119 |
+
## Table of Contents
|
120 |
+
|
121 |
+
1. [Model Summary](##model-summary)
|
122 |
+
2. [Use](##use)
|
123 |
+
3. [Limitations](##limitations)
|
124 |
+
4. [Training](##training)
|
125 |
+
5. [License](##license)
|
126 |
+
6. [Citation](##citation)
|
127 |
+
|
128 |
+
## Model Summary
|
129 |
+
|
130 |
+
The LLaVA-OneVision models are 7/72B parameter models trained on [LLaVA-NeXT-Video-SFT](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Video-SFT-Data), based on Qwen2 language model with a context window of 32K tokens.
|
131 |
+
|
132 |
+
- **Repository:** [LLaVA-VL/LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file)
|
133 |
+
- **Point of Contact:** [Yuanhan Zhang](mailto:drluodian@gmail.com)
|
134 |
+
- **Languages:** English, Chinese
|
135 |
+
|
136 |
+
|
137 |
+
## Use
|
138 |
+
|
139 |
+
### Intended use
|
140 |
+
|
141 |
+
The model was trained on [LLaVA-NeXT-Video-SFT](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Video-SFT-Data) and have the ability to interact with images, multi-image and videos, but specific to videos.
|
142 |
+
|
143 |
+
**Feel free to share your generations in the Community tab!**
|
144 |
+
|
145 |
+
### Generation
|
146 |
+
|
147 |
+
We provide the simple generation process for using our model. For more details, you could refer to [Github](https://github.com/LLaVA-VL/LLaVA-NeXT).
|
148 |
+
|
149 |
+
```python
|
150 |
+
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
|
151 |
+
from llava.model.builder import load_pretrained_model
|
152 |
+
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
|
153 |
+
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
|
154 |
+
from llava.conversation import conv_templates, SeparatorStyle
|
155 |
+
from PIL import Image
|
156 |
+
import requests
|
157 |
+
import copy
|
158 |
+
import torch
|
159 |
+
import sys
|
160 |
+
import warnings
|
161 |
+
from decord import VideoReader, cpu
|
162 |
+
import numpy as np
|
163 |
+
warnings.filterwarnings("ignore")
|
164 |
+
def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
|
165 |
+
if max_frames_num == 0:
|
166 |
+
return np.zeros((1, 336, 336, 3))
|
167 |
+
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
|
168 |
+
total_frame_num = len(vr)
|
169 |
+
video_time = total_frame_num / vr.get_avg_fps()
|
170 |
+
fps = round(vr.get_avg_fps()/fps)
|
171 |
+
frame_idx = [i for i in range(0, len(vr), fps)]
|
172 |
+
frame_time = [i/fps for i in frame_idx]
|
173 |
+
if len(frame_idx) > max_frames_num or force_sample:
|
174 |
+
sample_fps = max_frames_num
|
175 |
+
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
|
176 |
+
frame_idx = uniform_sampled_frames.tolist()
|
177 |
+
frame_time = [i/vr.get_avg_fps() for i in frame_idx]
|
178 |
+
frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
|
179 |
+
spare_frames = vr.get_batch(frame_idx).asnumpy()
|
180 |
+
# import pdb;pdb.set_trace()
|
181 |
+
return spare_frames,frame_time,video_time
|
182 |
+
pretrained = "lmms-lab/LLaVA-NeXT-Video-7B-Qwen2"
|
183 |
+
model_name = "llava_qwen"
|
184 |
+
device = "cuda"
|
185 |
+
device_map = "auto"
|
186 |
+
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
|
187 |
+
model.eval()
|
188 |
+
video_path = "XXXX"
|
189 |
+
max_frames_num = "64"
|
190 |
+
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
|
191 |
+
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
|
192 |
+
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
|
193 |
+
question = DEFAULT_IMAGE_TOKEN + "\nPlease describe this video in detail."
|
194 |
+
conv = copy.deepcopy(conv_templates[conv_template])
|
195 |
+
conv.append_message(conv.roles[0], question)
|
196 |
+
conv.append_message(conv.roles[1], None)
|
197 |
+
prompt_question = conv.get_prompt()
|
198 |
+
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
|
199 |
+
cont = model.generate(
|
200 |
+
input_ids,
|
201 |
+
images=video,
|
202 |
+
modalities="video"
|
203 |
+
do_sample=False,
|
204 |
+
temperature=0,
|
205 |
+
max_new_tokens=4096,
|
206 |
+
)
|
207 |
+
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
|
208 |
+
print(text_outputs)
|
209 |
+
```
|
210 |
+
|
211 |
+
|
212 |
+
# Training
|
213 |
+
|
214 |
+
## Model
|
215 |
+
|
216 |
+
- **Architecture:** SO400M + Qwen2
|
217 |
+
- **Initialized Model:** lmms-lab/llava-onevision-qwen2-7b-si
|
218 |
+
- **Data:** A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
|
219 |
+
- **Precision:** bfloat16
|
220 |
+
|
221 |
+
## Hardware & Software
|
222 |
+
|
223 |
+
- **GPUs:** 256 * Nvidia Tesla A100 (for whole model series training)
|
224 |
+
- **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
|
225 |
+
- **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
|
226 |
+
|
227 |
+
# Citation
|