weizhiwang commited on
Commit
e4f28c4
1 Parent(s): 88dd034

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -10
README.md CHANGED
@@ -7,7 +7,7 @@ language:
7
  - en
8
  ---
9
 
10
- # Model Card for LLaVA-Video-LLaMA-3-8B
11
 
12
  <!-- Provide a quick summary of what the model is/does. -->
13
 
@@ -18,7 +18,9 @@ Please follow my github repo [LLaVA-Video-Llama-3](https://github.com/Victorwz/L
18
  - [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
19
 
20
  ## Model Details
21
- Follows LLavA-1.5 pre-train and supervised fine-tuning pipeline. You do not need to change the LLaVA codebase to accommodate Llama-3.
 
 
22
 
23
  ## How to Use
24
 
@@ -39,11 +41,11 @@ from io import BytesIO
39
 
40
  # load model and processor
41
  device = "cuda" if torch.cuda.is_available() else "cpu"
42
- model_name = get_model_name_from_path("weizhiwang/llava_llama3_8b_video")
43
  tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/llava_llama3_8b_video", None, model_name, False, False, device=device)
44
 
45
  # prepare inputs for the model
46
- text = '<image>' + '\n' + "Describe the image."
47
  conv = conv_templates["llama_3"].copy()
48
  conv.append_message(conv.roles[0], text)
49
  conv.append_message(conv.roles[1], None)
@@ -51,7 +53,7 @@ prompt = conv.get_prompt()
51
  input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()
52
 
53
  # prepare image input
54
- url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
55
  response = requests.get(url)
56
  image = Image.open(BytesIO(response.content)).convert('RGB')
57
  image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda()
@@ -70,9 +72,6 @@ print(outputs[0])
70
  ```
71
  The image caption results look like:
72
  ```
73
- The image features a blue and orange double-decker bus parked on a street. The bus is stopped at a bus stop, waiting for passengers to board. There are several people standing around the bus, some of them closer to the bus and others further away.
74
-
75
- In the background, there are two cars parked on the street, one on the left side and the other on the right side. Additionally, there is a traffic light visible in the scene, indicating that the bus is stopped at an intersection.
76
  ```
77
 
78
  # Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
@@ -82,8 +81,8 @@ Please refer to a forked [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA
82
  ## Citation
83
 
84
  ```bibtex
85
- @misc{wang2024llavallama3,
86
- title={LLaVA-Llama-3-8B: A reproduction towards LLaVA-v1.5 based on Llama-3-8B LLM backbone},
87
  author={Wang, Weizhi},
88
  year={2024}
89
  }
 
7
  - en
8
  ---
9
 
10
+ # Model Card for LLaVA-Video-LLaMA-3
11
 
12
  <!-- Provide a quick summary of what the model is/does. -->
13
 
 
18
  - [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
19
 
20
  ## Model Details
21
+ - Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//15).
22
+ - Template: We follow the LLaVA-v1 template for constructing the conversation.
23
+ - Architecture: LLaVA architecture, visual encoder + MLP adapter + LLM backbone
24
 
25
  ## How to Use
26
 
 
41
 
42
  # load model and processor
43
  device = "cuda" if torch.cuda.is_available() else "cpu"
44
+ model_name = get_model_name_from_path("weizhiwang/LLaVA-Video-Llama-3")
45
  tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/llava_llama3_8b_video", None, model_name, False, False, device=device)
46
 
47
  # prepare inputs for the model
48
+ text = '<image>' + '\n' + "Describe the video."
49
  conv = conv_templates["llama_3"].copy()
50
  conv.append_message(conv.roles[0], text)
51
  conv.append_message(conv.roles[1], None)
 
53
  input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()
54
 
55
  # prepare image input
56
+ url = "https://github.com/PKU-YuanGroup/Video-LLaVA/blob/main/videollava/serve/examples/sample_demo_1.mp4"
57
  response = requests.get(url)
58
  image = Image.open(BytesIO(response.content)).convert('RGB')
59
  image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda()
 
72
  ```
73
  The image caption results look like:
74
  ```
 
 
 
75
  ```
76
 
77
  # Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
 
81
  ## Citation
82
 
83
  ```bibtex
84
+ @misc{wang2024llavavideollama3,
85
+ title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone},
86
  author={Wang, Weizhi},
87
  year={2024}
88
  }