Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,7 @@ pipeline_tag: text-generation
|
|
18 |
|
19 |
We combine the OmniLMM-12B and GPT-3.5 into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
|
20 |
|
21 |
-
|
22 |
<table>
|
23 |
<thead>
|
24 |
<tr>
|
@@ -118,68 +118,6 @@ pipeline_tag: text-generation
|
|
118 |
## Demo
|
119 |
Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081).
|
120 |
|
121 |
-
##
|
122 |
-
|
123 |
-
1. Clone this repository and navigate to the source folder
|
124 |
-
|
125 |
-
```bash
|
126 |
-
git clone https://github.com/OpenBMB/OmniLMM.git
|
127 |
-
cd OmniLMM
|
128 |
-
```
|
129 |
-
|
130 |
-
2. Create conda environment
|
131 |
-
|
132 |
-
```Shell
|
133 |
-
conda create -n OmniLMM python=3.10 -y
|
134 |
-
conda activate OmniLMM
|
135 |
-
```
|
136 |
-
|
137 |
-
3. Install dependencies
|
138 |
-
|
139 |
-
```shell
|
140 |
-
pip install -r requirements.txt
|
141 |
-
```
|
142 |
-
|
143 |
-
## Inference
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
### Multi-turn Conversation
|
148 |
-
Please refer to the following codes to run `OmniLMM`.
|
149 |
-
|
150 |
-
<div align="center">
|
151 |
-
<img src="assets/COCO_test2015_000000262144.jpg" width="660px">
|
152 |
-
</div>
|
153 |
-
|
154 |
-
##### OmniLMM-12B
|
155 |
-
```python
|
156 |
-
from chat import OmniLMMChat, img2base64
|
157 |
-
|
158 |
-
chat_model = OmniLMMChat('openbmb/OmniLMM-12B')
|
159 |
-
|
160 |
-
im_64 = img2base64('./data/COCO_test2015_000000262144.jpg')
|
161 |
-
|
162 |
-
# First round chat
|
163 |
-
msgs = [{"role": "user", "content": "What are the people doing?"}]
|
164 |
-
|
165 |
-
inputs = {"image": im_64, "question": json.dumps(msgs)}
|
166 |
-
answer = chat_model.process(inputs)
|
167 |
-
print(answer)
|
168 |
-
|
169 |
-
# Second round chat
|
170 |
-
# pass history context of multi-turn conversation
|
171 |
-
msgs.append({"role": "assistant", "content": answer})
|
172 |
-
msgs.append({"role": "user", "content": "Describe the image"})
|
173 |
-
|
174 |
-
inputs = {"image": im_64, "question": json.dumps(msgs)}
|
175 |
-
answer = chat_model.process(inputs)
|
176 |
-
print(answer)
|
177 |
-
```
|
178 |
-
|
179 |
-
We can obtain the following results:
|
180 |
-
```
|
181 |
-
"The people in the image are playing baseball. One person is pitching a ball, another one is swinging a bat to hit it, and there's also an umpire present who appears to be watching the game closely."
|
182 |
-
|
183 |
-
"The image depicts a baseball game in progress. A pitcher is throwing the ball, while another player is swinging his bat to hit it. An umpire can be seen observing the play closely."
|
184 |
-
```
|
185 |
|
|
|
18 |
|
19 |
We combine the OmniLMM-12B and GPT-3.5 into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
|
20 |
|
21 |
+
## Evaluation
|
22 |
<table>
|
23 |
<thead>
|
24 |
<tr>
|
|
|
118 |
## Demo
|
119 |
Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081).
|
120 |
|
121 |
+
## Usage
|
122 |
+
Looking at [github](https://github.com/OpenBMB/OmniLMM) for more detail about usage.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
123 |
|