weizhiwang commited on
Commit
7d92287
1 Parent(s): 883ca64

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -0
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc
3
+ datasets:
4
+ - liuhaotian/LLaVA-Instruct-150K
5
+ - liuhaotian/LLaVA-Pretrain
6
+ language:
7
+ - en
8
+ ---
9
+
10
+ # Model Card for LLaVA-Video-LLaMA-3-8B
11
+
12
+ <!-- Provide a quick summary of what the model is/does. -->
13
+
14
+ A reproduced LLaVA LVLM based on Llama-3-8B LLM backbone. Not an official implementation.
15
+ Please follow my reproduced implementation [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3/) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM.
16
+
17
+ ## Updates
18
+ - [6/4/2024] The codebase supports the video data fine-tuning for video understanding tasks.
19
+ - [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.
20
+
21
+ ## Model Details
22
+ Follows LLavA-1.5 pre-train and supervised fine-tuning pipeline. You do not need to change the LLaVA codebase to accommodate Llama-3.
23
+
24
+ ## How to Use
25
+
26
+ Please firstly install llava via
27
+ ```
28
+ pip install git+https://github.com/Victorwz/LLaVA-Video-Llama-3.git
29
+ ```
30
+
31
+ You can load the model and perform inference as follows:
32
+ ```python
33
+ from llava.conversation import conv_templates, SeparatorStyle
34
+ from llava.model.builder import load_pretrained_model
35
+ from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
36
+ from PIL import Image
37
+ import requests
38
+ import torch
39
+ from io import BytesIO
40
+
41
+ # load model and processor
42
+ device = "cuda" if torch.cuda.is_available() else "cpu"
43
+ model_name = get_model_name_from_path("weizhiwang/llava_llama3_8b_video")
44
+ tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/llava_llama3_8b_video", None, model_name, False, False, device=device)
45
+
46
+ # prepare inputs for the model
47
+ text = '<image>' + '\n' + "Describe the image."
48
+ conv = conv_templates["llama_3"].copy()
49
+ conv.append_message(conv.roles[0], text)
50
+ conv.append_message(conv.roles[1], None)
51
+ prompt = conv.get_prompt()
52
+ input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()
53
+
54
+ # prepare image input
55
+ url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
56
+ response = requests.get(url)
57
+ image = Image.open(BytesIO(response.content)).convert('RGB')
58
+ image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda()
59
+
60
+ # autoregressively generate text
61
+ with torch.inference_mode():
62
+ output_ids = model.generate(
63
+ input_ids,
64
+ images=image_tensor,
65
+ do_sample=False,
66
+ max_new_tokens=512,
67
+ use_cache=True)
68
+
69
+ outputs = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:], skip_special_tokens=True)
70
+ print(outputs[0])
71
+ ```
72
+ The image caption results look like:
73
+ ```
74
+ The image features a blue and orange double-decker bus parked on a street. The bus is stopped at a bus stop, waiting for passengers to board. There are several people standing around the bus, some of them closer to the bus and others further away.
75
+
76
+ In the background, there are two cars parked on the street, one on the left side and the other on the right side. Additionally, there is a traffic light visible in the scene, indicating that the bus is stopped at an intersection.
77
+ ```
78
+
79
+ # Fine-Tune LLaVA-Llama-3 on Your Video Instruction Data
80
+ Please refer to a forked [LLaVA-Video-Llama-3](https://github.com/Victorwz/LLaVA-Video-Llama-3) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer.
81
+
82
+
83
+ ## Citation
84
+
85
+ ```bibtex
86
+ @misc{wang2024llavallama3,
87
+ title={LLaVA-Llama-3-8B: A reproduction towards LLaVA-v1.5 based on Llama-3-8B LLM backbone},
88
+ author={Wang, Weizhi},
89
+ year={2024}
90
+ }
91
+ ```