finalf0 commited on
Commit
3efec51
1 Parent(s): d5746e1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +185 -0
README.md ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-generation
3
+ ---
4
+
5
+
6
+ ## OmniLMM 12B
7
+ **OmniLMM-12B** is the most capable version. The model is built based on [EVA02-5B](https://github.com/baaivision/EVA/tree/master/EVA-CLIP) and [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features:
8
+
9
+ - 🔥 **Strong Performance.**
10
+
11
+ OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also **supports OCR capability** and endows **rich multimodal world knowledge**.
12
+
13
+ - 🏆 **Trustworthy Behavior.**
14
+
15
+ LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using our recent [RLHF-V](https://rlhf-v.github.io/) technique) and **ranked #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench).
16
+
17
+ - 🕹 **Real-time Multimodal Interaction.**
18
+
19
+ We combine the OmniLMM-12B and GPT-3.5 into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
20
+
21
+
22
+ <table>
23
+ <thead>
24
+ <tr>
25
+ <th align="left">Model</th>
26
+ <th>Size</th>
27
+ <th>MME</th>
28
+ <th nowrap="nowrap" >MMMU val</th>
29
+ <th nowrap="nowrap" >MMHal-Bench</th>
30
+ <th nowrap="nowrap" >SeedBench-I</th>
31
+ <th nowrap="nowrap" >LLaVA Bench W</th>
32
+ <th>MathVista</th>
33
+ <th nowrap="nowrap">MMB dev (en)</th>
34
+ </tr>
35
+ </thead>
36
+ <tbody align="center">
37
+ <tr>
38
+ <td align="left">GPT-4V †</td>
39
+ <td>-</td>
40
+ <td>1409</td>
41
+ <td>56.8</td>
42
+ <td>3.53 / 70.8</td>
43
+ <td>71.6 </td>
44
+ <td>93.1 </td>
45
+ <td>47.8 </td>
46
+ <td>75.1 </td>
47
+ </tr>
48
+ <tr>
49
+ <td nowrap="nowrap" align="left">Qwen-VL-Plus †</td>
50
+ <td>-</td>
51
+ <td>1681</td>
52
+ <td>45.2</td>
53
+ <td>- </td>
54
+ <td>65.7 </td>
55
+ <td>73.7 </td>
56
+ <td>36.0 </td>
57
+ <td>66.2 </td>
58
+ </tr>
59
+ <tr>
60
+ <td align="left">Yi-VL 6B</td>
61
+ <td align="right">6.7B </td>
62
+ <td>- </td>
63
+ <td>39.1 </td>
64
+ <td>- </td>
65
+ <td>66.1 </td>
66
+ <td>39.9 </td>
67
+ <td>28.0 </td>
68
+ <td>68.2 </td>
69
+ </tr>
70
+ <tr>
71
+ <td align="left" >CogVLM</td>
72
+ <td align="right">17.4B</td>
73
+ <td>1438</td>
74
+ <td>32.1 </td>
75
+ <td>2.68 / 52.1 </td>
76
+ <td>68.8 </td>
77
+ <td>73.9 </td>
78
+ <td>34.7 </td>
79
+ <td>63.7 </td>
80
+ </tr>
81
+ <tr>
82
+ <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
83
+ <td align="right">9.6B</td>
84
+ <td>1488</td>
85
+ <td>35.9</td>
86
+ <td>2.93 / 59.4</td>
87
+ <td>64.8 </td>
88
+ <td>67.7 </td>
89
+ <td>33.8 </td>
90
+ <td>60.6 </td>
91
+ </tr>
92
+ <tr>
93
+ <td align="left" >LLaVA 1.5</td>
94
+ <td align="right">13.6B </td>
95
+ <td>1531 </td>
96
+ <td>36.4 </td>
97
+ <td>2.71 / 51.0 </td>
98
+ <td>68.1 </td>
99
+ <td>64.6 </td>
100
+ <td>26.4 </td>
101
+ <td>68.2 </td>
102
+ </tr>
103
+ <tr>
104
+ <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
105
+ <td align="right">11.6B </td>
106
+ <td>1637 </td>
107
+ <td>40.7 </td>
108
+ <td>3.45 / 68.8 </td>
109
+ <td>71.1 </td>
110
+ <td>72.0 </td>
111
+ <td>34.9 </td>
112
+ <td>71.6 </td>
113
+ </tr>
114
+ </tbody>
115
+ </table>
116
+ <small>†: closed-source models</small>
117
+
118
+ ## Demo
119
+ Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081).
120
+
121
+ ## Install
122
+
123
+ 1. Clone this repository and navigate to the source folder
124
+
125
+ ```bash
126
+ git clone https://github.com/OpenBMB/OmniLMM.git
127
+ cd OmniLMM
128
+ ```
129
+
130
+ 2. Create conda environment
131
+
132
+ ```Shell
133
+ conda create -n OmniLMM python=3.10 -y
134
+ conda activate OmniLMM
135
+ ```
136
+
137
+ 3. Install dependencies
138
+
139
+ ```shell
140
+ pip install -r requirements.txt
141
+ ```
142
+
143
+ ## Inference
144
+
145
+
146
+
147
+ ### Multi-turn Conversation
148
+ Please refer to the following codes to run `OmniLMM`.
149
+
150
+ <div align="center">
151
+ <img src="assets/COCO_test2015_000000262144.jpg" width="660px">
152
+ </div>
153
+
154
+ ##### OmniLMM-12B
155
+ ```python
156
+ from chat import OmniLMMChat, img2base64
157
+
158
+ chat_model = OmniLMMChat('openbmb/OmniLMM-12B')
159
+
160
+ im_64 = img2base64('./data/COCO_test2015_000000262144.jpg')
161
+
162
+ # First round chat
163
+ msgs = [{"role": "user", "content": "What are the people doing?"}]
164
+
165
+ inputs = {"image": im_64, "question": json.dumps(msgs)}
166
+ answer = chat_model.process(inputs)
167
+ print(answer)
168
+
169
+ # Second round chat
170
+ # pass history context of multi-turn conversation
171
+ msgs.append({"role": "assistant", "content": answer})
172
+ msgs.append({"role": "user", "content": "Describe the image"})
173
+
174
+ inputs = {"image": im_64, "question": json.dumps(msgs)}
175
+ answer = chat_model.process(inputs)
176
+ print(answer)
177
+ ```
178
+
179
+ We can obtain the following results:
180
+ ```
181
+ "The people in the image are playing baseball. One person is pitching a ball, another one is swinging a bat to hit it, and there's also an umpire present who appears to be watching the game closely."
182
+
183
+ "The image depicts a baseball game in progress. A pitcher is throwing the ball, while another player is swinging his bat to hit it. An umpire can be seen observing the play closely."
184
+ ```
185
+