openbmb
/

OmniLMM-12B

Visual Question Answering

text-generation

Inference Endpoints

Model card Files Files and versions Community

finalf0 commited on Jan 31

Commit

5cb9540

•

1 Parent(s): 3efec51

Update README.md

Files changed (1) hide show

README.md +3 -65

README.md CHANGED Viewed

@@ -18,7 +18,7 @@ pipeline_tag: text-generation
   We combine the OmniLMM-12B and GPT-3.5 into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
 <table>
 <thead>
   <tr>
@@ -118,68 +118,6 @@ pipeline_tag: text-generation
 ## Demo
 Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081).
-## Install
-1. Clone this repository and navigate to the source folder
-```bash
-git clone https://github.com/OpenBMB/OmniLMM.git
-cd OmniLMM
-```
-2. Create conda environment
-```Shell
-conda create -n OmniLMM python=3.10 -y
-conda activate OmniLMM
-```
-3. Install dependencies
-```shell
-pip install -r requirements.txt
-```
-## Inference
-### Multi-turn Conversation
-Please refer to the following codes to run `OmniLMM`.
-<div align="center">
-<img src="assets/COCO_test2015_000000262144.jpg" width="660px">
-</div>
-##### OmniLMM-12B
-```python
-from chat import OmniLMMChat, img2base64
-chat_model = OmniLMMChat('openbmb/OmniLMM-12B')
-im_64 = img2base64('./data/COCO_test2015_000000262144.jpg')
-# First round chat
-msgs = [{"role": "user", "content": "What are the people doing?"}]
-inputs = {"image": im_64, "question": json.dumps(msgs)}
-answer = chat_model.process(inputs)
-print(answer)
-# Second round chat
-# pass history context of multi-turn conversation
-msgs.append({"role": "assistant", "content": answer})
-msgs.append({"role": "user", "content": "Describe the image"})
-inputs = {"image": im_64, "question": json.dumps(msgs)}
-answer = chat_model.process(inputs)
-print(answer)
-```
-We can obtain the following results:
-```
-"The people in the image are playing baseball. One person is pitching a ball, another one is swinging a bat to hit it, and there's also an umpire present who appears to be watching the game closely."
-"The image depicts a baseball game in progress. A pitcher is throwing the ball, while another player is swinging his bat to hit it. An umpire can be seen observing the play closely."
-```

   We combine the OmniLMM-12B and GPT-3.5 into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
+## Evaluation
 <table>
 <thead>
   <tr>
 ## Demo
 Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081).
+## Usage
+Looking at [github](https://github.com/OpenBMB/OmniLMM) for more detail about usage.