weizhiwang
/

LLaVA-Video-Llama-3

@@ -7,7 +7,7 @@ language:
 - en
 ---
-# Model Card for LLaVA-Video-LLaMA-3-8B
 <!-- Provide a quick summary of what the model is/does. -->
@@ -18,7 +18,9 @@ Please follow my github repo [LLaVA-Video-Llama-3](https://github.com/Victorwz/L
 - [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
 ## Model Details
-Follows LLavA-1.5 pre-train and supervised fine-tuning pipeline. You do not need to change the LLaVA codebase to accommodate Llama-3.
 ## How to Use
@@ -39,11 +41,11 @@ from io import BytesIO
 # load model and processor
 device = "cuda" if torch.cuda.is_available() else "cpu"
-model_name = get_model_name_from_path("weizhiwang/llava_llama3_8b_video")
 tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/llava_llama3_8b_video", None, model_name, False, False, device=device)
 # prepare inputs for the model
-text = '<image>' + '\n' + "Describe the image."
 conv = conv_templates["llama_3"].copy()
 conv.append_message(conv.roles[0], text)
 conv.append_message(conv.roles[1], None)
@@ -51,7 +53,7 @@ prompt = conv.get_prompt()
 input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()
 # prepare image input
-url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
 response = requests.get(url)
 image = Image.open(BytesIO(response.content)).convert('RGB')
 image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda()
@@ -70,9 +72,6 @@ print(outputs[0])
 ```
 The image caption results look like:
 ```
-The image features a blue and orange double-decker bus parked on a street. The bus is stopped at a bus stop, waiting for passengers to board. There are several people standing around the bus, some of them closer to the bus and others further away.
-In the background, there are two cars parked on the street, one on the left side and the other on the right side. Additionally, there is a traffic light visible in the scene, indicating that the bus is stopped at an intersection.
 ```
 # Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
@@ -82,8 +81,8 @@ Please refer to a forked [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA
 ## Citation
 ```bibtex
-@misc{wang2024llavallama3,
-  title={LLaVA-Llama-3-8B: A reproduction towards LLaVA-v1.5 based on Llama-3-8B LLM backbone},
   author={Wang, Weizhi},
   year={2024}
 }

 - en
 ---
+# Model Card for LLaVA-Video-LLaMA-3
 <!-- Provide a quick summary of what the model is/does. -->
 - [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
 ## Model Details
+- Video Frame Sampling: Considering we adopt CLIP-ViT-L-336px as the image encoder (576 tokens for one image) and the context window of LLaMA-3 is 8k, the video frame sampling rate is set as max(30, num_frames//15).
+- Template: We follow the LLaVA-v1 template for constructing the conversation.
+- Architecture: LLaVA architecture, visual encoder + MLP adapter + LLM backbone
 ## How to Use
 # load model and processor
 device = "cuda" if torch.cuda.is_available() else "cpu"
+model_name = get_model_name_from_path("weizhiwang/LLaVA-Video-Llama-3")
 tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/llava_llama3_8b_video", None, model_name, False, False, device=device)
 # prepare inputs for the model
+text = '<image>' + '\n' + "Describe the video."
 conv = conv_templates["llama_3"].copy()
 conv.append_message(conv.roles[0], text)
 conv.append_message(conv.roles[1], None)
 input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()
 # prepare image input
+url = "https://github.com/PKU-YuanGroup/Video-LLaVA/blob/main/videollava/serve/examples/sample_demo_1.mp4"
 response = requests.get(url)
 image = Image.open(BytesIO(response.content)).convert('RGB')
 image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda()
 ```
 The image caption results look like:
 ```
 ```
 # Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
 ## Citation
 ```bibtex
+@misc{wang2024llavavideollama3,
+  title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone},
   author={Wang, Weizhi},
   year={2024}
 }