weizhiwang
/

LLaVA-Video-Llama-3

+---
+license: cc
+datasets:
+- liuhaotian/LLaVA-Instruct-150K
+- liuhaotian/LLaVA-Pretrain
+language:
+- en
+---
+# Model Card for LLaVA-Video-LLaMA-3-8B
+<!-- Provide a quick summary of what the model is/does. -->
+A reproduced LLaVA LVLM based on Llama-3-8B LLM backbone. Not an official implementation.
+Please follow my reproduced implementation [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3/) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM.
+## Updates
+- [6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks.
+- [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
+## Model Details
+Follows LLavA-1.5 pre-train and supervised fine-tuning pipeline. You do not need to change the LLaVA codebase to accommodate Llama-3.
+## How to Use
+Please firstly install llava via
+```
+pip install git+https://github.com/Victorwz/LLaVA-Video-Llama-3.git
+```
+You can load the model and perform inference as follows:
+```python
+from llava.conversation import conv_templates, SeparatorStyle
+from llava.model.builder import load_pretrained_model
+from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
+from PIL import Image
+import requests
+import torch
+from io import BytesIO
+# load model and processor
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model_name = get_model_name_from_path("weizhiwang/llava_llama3_8b_video")
+tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/llava_llama3_8b_video", None, model_name, False, False, device=device)
+# prepare inputs for the model
+text = '<image>' + '\n' + "Describe the image."
+conv = conv_templates["llama_3"].copy()
+conv.append_message(conv.roles[0], text)
+conv.append_message(conv.roles[1], None)
+prompt = conv.get_prompt()
+input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()
+# prepare image input
+url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
+response = requests.get(url)
+image = Image.open(BytesIO(response.content)).convert('RGB')
+image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda()
+# autoregressively generate text
+with torch.inference_mode():
+    output_ids = model.generate(
+        input_ids,
+        images=image_tensor,
+        do_sample=False,
+        max_new_tokens=512,
+        use_cache=True)
+outputs = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:], skip_special_tokens=True)
+print(outputs[0])
+```
+The image caption results look like:
+```
+The image features a blue and orange double-decker bus parked on a street. The bus is stopped at a bus stop, waiting for passengers to board. There are several people standing around the bus, some of them closer to the bus and others further away.
+In the background, there are two cars parked on the street, one on the left side and the other on the right side. Additionally, there is a traffic light visible in the scene, indicating that the bus is stopped at an intersection.
+```
+# Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
+Please refer to a forked [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer.
+## Citation
+```bibtex
+@misc{wang2024llavallama3,
+  title={LLaVA-Llama-3-8B: A reproduction towards LLaVA-v1.5 based on Llama-3-8B LLM backbone},
+  author={Wang, Weizhi},
+  year={2024}
+}
+```