RaushanTurganbay HF staff commited on
Commit
9620ec6
1 Parent(s): 5c77409

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +178 -1
README.md CHANGED
@@ -1,4 +1,181 @@
1
  ---
2
  license: llama2
3
  pipeline_tag: image-text-to-text
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama2
3
  pipeline_tag: image-text-to-text
4
+ language:
5
+ - en
6
+ ---
7
+
8
+ # LLaVA-NeXT-Video Model Card
9
+
10
+ Below is the model card of LLaVa-NeXT-Video model 7b, which is copied from the original Llava model card that you can find [here](https://huggingface.co/liuhaotian/llava-v1.5-13b).
11
+
12
+ Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qsl6cd2c8gGtEW1xV5io7S8NHh-Cp1TV?usp=sharing)
13
+
14
+ Or check out our Spaces demo! [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md-dark.svg)](https://huggingface.co/spaces/llava-hf/llava-4bit)
15
+
16
+
17
+ ## Model details
18
+
19
+ **Model type:**
20
+ <br>
21
+ LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
22
+ <br>
23
+ Base LLM: lmsys/vicuna-7b-v1.5
24
+
25
+ **Model date:**
26
+ <br>
27
+ LLaVA-Next-Video-7B was trained in April 2024.
28
+
29
+ **Paper or resources for more information:**
30
+ <br>
31
+ https://github.com/LLaVA-VL/LLaVA-NeXT
32
+
33
+
34
+ ## How to use the model
35
+
36
+ First, make sure to have `transformers >= 4.42.0`.
37
+ The model supports multi-visual and multi-prompt generation. Meaning that you can pass multiple images/videos in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `<image>` or `<video>` to the location where you want to query images/videos:
38
+
39
+ Below is an example script to run generation in `float16` precision on a GPU device:
40
+
41
+ ```python
42
+ import requests
43
+ from PIL import Image
44
+ import av
45
+ import torch
46
+ from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
47
+
48
+ model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"
49
+
50
+ prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
51
+ image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
52
+
53
+ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
54
+ model_id,
55
+ torch_dtype=torch.float16,
56
+ low_cpu_mem_usage=True,
57
+ ).to(0)
58
+
59
+ processor = LlavaNextVideoProcessor.from_pretrained(model_id)
60
+
61
+ def read_video_pyav(container, indices):
62
+ '''
63
+ Decode the video with PyAV decoder.
64
+ Args:
65
+ container (`av.container.input.InputContainer`): PyAV container.
66
+ indices (`List[int]`): List of frame indices to decode.
67
+ Returns:
68
+ result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
69
+ '''
70
+ frames = []
71
+ container.seek(0)
72
+ start_index = indices[0]
73
+ end_index = indices[-1]
74
+ for i, frame in enumerate(container.decode(video=0)):
75
+ if i > end_index:
76
+ break
77
+ if i >= start_index and i in indices:
78
+ frames.append(frame)
79
+ return np.stack([x.to_ndarray(format="rgb24") for x in frames])
80
+
81
+ prompt = "USER: <video>\nWhy is this video funny? ASSISTANT:"
82
+ video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
83
+ container = av.open(video_path)
84
+
85
+ # sample uniformly 8 frames from the video
86
+ total_frames = container.streams.video[0].frames
87
+ indices = np.arange(0, total_frames, total_frames / 8).astype(int)
88
+ clip = read_video_pyav(container, indices)
89
+ inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)
90
+
91
+ output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
92
+ print(processor.decode(output[0][2:], skip_special_tokens=True))
93
+ ```
94
+
95
+ ### Inference with images as inputs
96
+
97
+ To generate from images use the below code after loading the model as shown above:
98
+
99
+ ```python
100
+ raw_image = Image.open(requests.get(image_file, stream=True).raw)
101
+ inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
102
+
103
+ output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
104
+ print(processor.decode(output[0][2:], skip_special_tokens=True))
105
+ ```
106
+
107
+ ### Inference with images and videos as inputs
108
+
109
+ To generate from images and videos in one generate use the below code after loading the model as shown above:
110
+
111
+ ```python
112
+ prompts = [
113
+ "USER: <image>\nWhat's the content of the image? ASSISTANT:",
114
+ "USER: <video>\nWhy is this video funny? ASSISTANT:"
115
+ ]
116
+ inputs = processor(text=prompts, images=image, videos=clip, padding=True, return_tensors="pt").to(model.device)
117
+
118
+ # Generate
119
+ generate_ids = model.generate(**inputs, max_new_tokens=100)
120
+ out = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
121
+ print(out)
122
+ ```
123
+
124
+ ### Model optimization
125
+
126
+ #### 4-bit quantization through `bitsandbytes` library
127
+
128
+ First make sure to install `bitsandbytes`, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:
129
+
130
+ ```diff
131
+ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
132
+ model_id,
133
+ torch_dtype=torch.float16,
134
+ low_cpu_mem_usage=True,
135
+ + load_in_4bit=True
136
+ )
137
+ ```
138
+
139
+ #### Use Flash-Attention 2 to further speed-up generation
140
+
141
+ First make sure to install `flash-attn`. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
142
+
143
+ ```diff
144
+ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
145
+ model_id,
146
+ torch_dtype=torch.float16,
147
+ low_cpu_mem_usage=True,
148
+ + use_flash_attention_2=True
149
+ ).to(0)
150
+ ```
151
+
152
+ ## License
153
+ Llama 2 is licensed under the LLAMA 2 Community License,
154
+ Copyright (c) Meta Platforms, Inc. All Rights Reserved.
155
+
156
+ ## Intended use
157
+ **Primary intended uses:**
158
+ <br>
159
+ The primary use of LLaVA is research on large multimodal models and chatbots.
160
+
161
+ **Primary intended users:**
162
+ <br>
163
+ The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
164
+
165
+ ## Training dataset
166
+
167
+ ### Image
168
+ - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
169
+ - 158K GPT-generated multimodal instruction-following data.
170
+ - 500K academic-task-oriented VQA data mixture.
171
+ - 50K GPT-4V data mixture.
172
+ - 40K ShareGPT data.
173
+
174
+ ### Video
175
+ - 100K VideoChatGPT-Instruct.
176
+
177
+ ## Evaluation dataset
178
+ A collection of 4 benchmarks, including 3 academic VQA benchmarks and 1 captioning benchmark.
179
+
180
+
181
+