Ivy1997 commited on
Commit
72c3b41
1 Parent(s): 92d9df7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -3
README.md CHANGED
@@ -1,3 +1,111 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-3B-Instruct
5
+ - google/siglip-so400m-patch14-384
6
+ tags:
7
+ - Transformers
8
+ ---
9
+
10
+
11
+ # Huggingface Model Card
12
+
13
+ IvyVL is a lightweight multimodal model with only 3B parameters. It accepts both image and text inputs to generate text outputs. Thanks to its lightweight design, it can be deployed on edge devices such as AI glasses and smartphones, offering low memory usage and high speed while maintaining strong performance on multimodal tasks. The model is built upon the [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) language model, with [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) serving as the vision encoder. 
14
+
15
+ # Model Summary:
16
+
17
+ * Developed: Standford, CMU, AI Safeguard
18
+
19
+ * Model type: Multi-modal model (image+text)
20
+
21
+ * Language: Engligh and Chinese
22
+
23
+ * License: Apache 2.0
24
+
25
+ * Architecture: Based on LLaVA-One-Vision
26
+
27
+
28
+ # Evaluation:
29
+
30
+ ![image.jpeg](evaluation.jpeg)
31
+
32
+ Most of the performance data comes from the VLMEvalKit leaderboard or the original papers. We conducted evaluations using VLMEvalKit. Due to differences in environments and the LLMs used for evaluation, there may be slight variations in performance.
33
+
34
+ # How to use:
35
+
36
+ ```python
37
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
38
+ from llava.model.builder import load_pretrained_model
39
+ from llava.mm_utils import process_images, tokenizer_image_token
40
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
41
+ from llava.conversation import conv_templates
42
+ from PIL import Image
43
+ import requests
44
+ import copy
45
+ import torch
46
+ import warnings
47
+
48
+ warnings.filterwarnings("ignore")
49
+
50
+ pretrained = "AI-Safeguard/Ivy-VL"
51
+
52
+ model_name = "llava_qwen"
53
+ device = "cuda"
54
+ device_map = "auto"
55
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
56
+
57
+ model.eval()
58
+
59
+ # load image from url
60
+ url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
61
+ image = Image.open(requests.get(url, stream=True).raw)
62
+
63
+ # load image from local environment
64
+ # url = "./local_image.jpg"
65
+ # image = Image.open(url)
66
+
67
+ image_tensor = process_images([image], image_processor, model.config)
68
+ image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
69
+
70
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
71
+ question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
72
+ conv = copy.deepcopy(conv_templates[conv_template])
73
+ conv.append_message(conv.roles[0], question)
74
+ conv.append_message(conv.roles[1], None)
75
+ prompt_question = conv.get_prompt()
76
+
77
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
78
+ image_sizes = [image.size]
79
+
80
+ cont = model.generate(
81
+ input_ids,
82
+ images=image_tensor,
83
+ image_sizes=image_sizes,
84
+ do_sample=False,
85
+ temperature=0,
86
+ max_new_tokens=4096,
87
+ )
88
+
89
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
90
+
91
+ print(text_outputs)
92
+ ```
93
+
94
+ # Future Plan:
95
+
96
+ * We plan to release more versions of LLMs in different sizes.
97
+
98
+ * We will focus on improving the performance of the video modality.
99
+
100
+
101
+ # Citation:
102
+
103
+ ```plaintext
104
+ @misc{ivy2024ivy-vl,
105
+ title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
106
+ url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
107
+ author={Ivy Zhang,Jenny,Theresa and David Qiu},
108
+ month={December},
109
+ year={2024}
110
+ }
111
+ ```