visheratin commited on
Commit
7190045
·
1 Parent(s): 56aa8a6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -0
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - liuhaotian/LLaVA-Pretrain
4
+ - liuhaotian/LLaVA-Instruct-150K
5
+ language:
6
+ - en
7
+ tags:
8
+ - llava
9
+ - phi
10
+ ---
11
+
12
+ # LLaVA-3b Model Card
13
+
14
+ ## Model details
15
+
16
+ LLaVA-3b is a model fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) in a LLaVA fashion using vision tower from
17
+ [SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384). There are a couple of things different from the original LLaVA architecture:
18
+
19
+ 1. Multiple image tokens. The multimodal projector generates embeddings of shape [5, 2560] instead of [1, 2560] for images. The idea is that using more tokens
20
+ allows to get more info from the image into the language model.
21
+ 2. The model uses the output from the latest layer of the vision encoder instead of intermediate one.
22
+
23
+ As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
24
+
25
+ ```
26
+ <|im_start|>system
27
+ You are Dolphin, a helpful AI assistant.<|im_end|>
28
+ <|im_start|>user
29
+ {prompt}<|im_end|>
30
+ <|im_start|>assistant
31
+ ```
32
+
33
+ ## How to use
34
+
35
+ **Install dependencies**
36
+
37
+ ```
38
+ !pip install -q open_clip_torch timm einops
39
+ ```
40
+
41
+ **Download modeling files**
42
+
43
+ ```
44
+ from huggingface_hub import hf_hub_download
45
+
46
+ hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True)
47
+ hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True)
48
+ hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True)
49
+ hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True)
50
+ hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)
51
+ ```
52
+
53
+ **Create a model**
54
+
55
+ ```
56
+ from modeling_llava import LlavaForConditionalGeneration
57
+ import torch
58
+
59
+ model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b", torch_dtype=torch.float16)
60
+ model = model.to("cuda")
61
+ ```
62
+
63
+ **Create processors**
64
+
65
+ ```
66
+ from transformers import AutoTokenizer
67
+ from processing_llava import LlavaProcessor, OpenCLIPImageProcessor
68
+
69
+ tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")
70
+ image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
71
+ processor = LlavaProcessor(image_processor, tokenizer)
72
+ ```
73
+
74
+ **Set image and text**
75
+
76
+ ```
77
+ from PIL import Image
78
+ import requests
79
+
80
+ image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
81
+ raw_image = Image.open(requests.get(image_file, stream=True).raw)
82
+
83
+ prompt = """<|im_start|>system
84
+ A chat between a curious human and an artificial intelligence assistant.
85
+ The assistant gives helpful, detailed, and polite answers to the human's questions.
86
+ The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
87
+ <|im_start|>user
88
+ <image>
89
+ Describe the image.<|im_end|>
90
+ <|im_start|>assistant
91
+ """
92
+ ```
93
+
94
+ **Process inputs**
95
+
96
+ ```
97
+ inputs = processor(prompt, raw_image, model, return_tensors='pt')
98
+
99
+ inputs['input_ids'] = inputs['input_ids'].to(model.device)
100
+ inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
101
+ ```
102
+
103
+ **Generate the data**
104
+
105
+ ```
106
+ output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.5, temperature=1.2, eos_token_id=tokenizer.eos_token_id)
107
+ ```
108
+
109
+ ## License
110
+ This model is based on Phi-2 and is governed by Microsoft's microsoft-research-license which prohibits commercial use.
111
+
112
+ **Where to send questions or comments about the model:**
113
+ https://twitter.com/visheratin