--- license: cc datasets: - liuhaotian/LLaVA-Instruct-150K - liuhaotian/LLaVA-Pretrain language: - en --- # Model Card for LLaVA-Video-LLaMA-3-8B Please follow my github repo [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3/) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM. ## Updates - [6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks. - [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models. ## Model Details Follows LLavA-1.5 pre-train and supervised fine-tuning pipeline. You do not need to change the LLaVA codebase to accommodate Llama-3. ## How to Use Please firstly install llava via ``` pip install git+https://github.com/Victorwz/LLaVA-Video-Llama-3.git ``` You can load the model and perform inference as follows: ```python from llava.conversation import conv_templates, SeparatorStyle from llava.model.builder import load_pretrained_model from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path from PIL import Image import requests import torch from io import BytesIO # load model and processor device = "cuda" if torch.cuda.is_available() else "cpu" model_name = get_model_name_from_path("weizhiwang/llava_llama3_8b_video") tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/llava_llama3_8b_video", None, model_name, False, False, device=device) # prepare inputs for the model text = '' + '\n' + "Describe the image." conv = conv_templates["llama_3"].copy() conv.append_message(conv.roles[0], text) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda() # prepare image input url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png" response = requests.get(url) image = Image.open(BytesIO(response.content)).convert('RGB') image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda() # autoregressively generate text with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=False, max_new_tokens=512, use_cache=True) outputs = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:], skip_special_tokens=True) print(outputs[0]) ``` The image caption results look like: ``` The image features a blue and orange double-decker bus parked on a street. The bus is stopped at a bus stop, waiting for passengers to board. There are several people standing around the bus, some of them closer to the bus and others further away. In the background, there are two cars parked on the street, one on the left side and the other on the right side. Additionally, there is a traffic light visible in the scene, indicating that the bus is stopped at an intersection. ``` # Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data Please refer to a forked [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer. ## Citation ```bibtex @misc{wang2024llavallama3, title={LLaVA-Llama-3-8B: A reproduction towards LLaVA-v1.5 based on Llama-3-8B LLM backbone}, author={Wang, Weizhi}, year={2024} } ```