Megrez-3B-Omni / README_EN.md
lizhiyuan
update model
a951ae0
metadata
license: apache-2.0

Megrez-3B-Omni: The First Open-Source On-device LLM with Full Modality Understanding

🔗 GitHub   |   🏠 Demo   |   📖 WeChat Official   |   💬 WeChat Groups  

中文 | English

Introduction

Megrez-3B-Omni is an on-device multimodal understanding LLM model developed by Infinigence AI (Infinigence AI). It is an extension of the Megrez-3B-Instruct model and supports analysis of image, text, and audio modalities. The model achieves state-of-the-art accuracy in all three domains:

  • Image Understanding: By utilizing SigLip-400M for constructing image tokens, Megrez-3B-Omni outperforms models with more parameters such as LLaVA-NeXT-Yi-34B. It is one of the best image understanding models among multiple mainstream benchmarks, including MME, MMMU, and OCRBench. It demonstrates excellent performance in tasks such as scene understanding and OCR.
  • Language Understanding: Megrez-3B-Omni retains text understanding capabilities without significant trade-offs. Compared to its single-modal counterpart (Megrez-3B-Instruct), the accuracy variation is less than 2%, maintaining state-of-the-art performance on benchmarks like C-EVAL, MMLU/MMLU Pro, and AlignBench. It also outperforms previous-generation models with 14B parameters.
  • Speech Understanding: Equipped with the encoder head of Qwen2-Audio/whisper-large-v3, the model supports both Chinese and English speech input, multi-turn conversations, and voice-based questions about input images. It can directly respond to voice commands with text and achieved leading results across multiple benchmarks.

Model Info

Language Module Vision Module Audio Module
Architecture Llama-2 with GQA SigLip-SO400M Whisper-large-v3 (encoder-only)
# Params (Backbone) 2.29B 0.42B 0.64B
Connector - Cross Attention Linear
# Params (Others) Emb: 0.31B
Softmax: 0.31B
Connector: 0.036B Connector: 0.003B
# Params (Total) 4B
# Vocab Size 122880 64 tokens/slice -
Context length 4K tokens
Supported languages Chinese & English

Image Understanding

  • The above image compares the performance of Megrez-3B-Omni with other open-source models on mainstream image multimodal tasks.
  • The below image shows the performance of Megrez-3B-Omni on the OpenCompass test set. Image reference: InternVL 2.5 Blog Post

Multitask OpencompassBmk

model basemodel release time OpenCompass MME MMMU val OCRBench MathVista RealWorldQA MMVet hallusionBench MMB TEST (en) MMB TEST (zh) TextVQA val AI2D_TEST MMstar DocVQA_TEST
Megrez-3B-Omni Megrez-3B 2024.12.16 66.2 2315 51.89 82.8 62 71.89 60 50.12 80.8 82.3 80.3 82.05 60.46 91.62
Qwen2-VL-2B-Instruct Qwen2-1.5B 2024.08.28 57.2 1872 41.1 79.4 43 62.9 49.5 41.7 74.9 73.5 79.7 74.7 48 90.1
InternVL2.5-2B Internlm2.5-1.8B-chat 2024.12.06 59.9 2138 43.6 80.4 51.3 60.1 60.8 42.6 74.7 71.9 74.3 74.9 53.7 88.7
BlueLM-V-3B - 2024.11.29 66.1 - 45.1 82.9 60.8 66.7 61.8 48 83 80.5 78.4 85.3 62.3 87.8
InternVL2.5-4B Qwen2.5-3B-Instruct 2024.12.06 65.1 2337 52.3 82.8 60.5 64.3 60.6 46.3 81.1 79.3 76.8 81.4 58.3 91.6
Baichuan-Omni Unknown-7B 2024.10.11 - 2186 47.3 70.0 51.9 62.6 65.4 47.8 76.2 74.9 74.3 - - -
MiniCPM-V-2.6 Qwen2-7B 2024.08.06 65.2 2348 49.8 85.2 60.6 69.7 60 48.1 81.2 79 80.1 82.1 57.26 90.8
Qwen2-VL-7B-Instruct Qwen2-7B 2024.08.28 67 2326 54.1 84.5 58.2 70.1 62 50.6 83 80.5 84.3 83 60.7 94.5
MiniCPM-Llama3-V-2.5 Llama3-Instruct 8B 2024.05.20 58.8 2024 45.8 72.5 54.3 63.5 52.8 42.4 77.2 74.2 76.6 78.4 - 84.8
VITA Mixtral 8x7B 2024.08.12 - 2097 47.3 67.8 44.9 59 41.6 39.7 74.7 71.4 71.8 - - -
GLM-4V-9B GLM-4-9B 2024.06.04 59.1 2018 46.9 77.6 51.1 - 58 46.6 81.1 79.4 - 81.1 58.7 -
LLaVA-NeXT-Yi-34B Yi-34B 2024.01.18 55 2006 48.8 57.4 40.4 66 50.7 34.8 81.1 79 69.3 78.9 51.6 -
Qwen2-VL-72B-Instruct Qwen2-72B 2024.08.28 74.8 2482 64.5 87.7 70.5 77.8 74 58.1 86.5 86.6 85.5 88.1 68.3 96.5

Text Understanding

Chat&Instruction Zh&En Tasks Code Math
models Instruction Release Time Non-Emb Params MT-Bench AlignBench (ZH) IFEval C-EVAL (ZH) CMMLU (ZH) MMLU MMLU-Pro HumanEval MBPP GSM8K MATH
Megrez-3B-Omni Y 2024.12.16 2.3 8.4 6.94 66.5 84.0 75.3 73.3 45.2 72.6 60.6 63.8 27.3
Megrez-3B-Instruct Y 2024.12.16 2.3 8.64 7.06 68.6 84.8 74.7 72.8 46.1 78.7 71.0 65.5 28.3
Baichuan-Omni Y 2024.10.11 7.0 - - - 68.9 72.2 65.3 - - - - -
VITA Y 2024.08.12 12.9 - - - 56.7 46.6 71.0 - - - 75.7 -
Qwen1.5-7B 2024.02.04 6.5 - - - 74.1 73.1 61.0 29.9 36.0 51.6 62.5 20.3
Qwen1.5-7B-Chat Y 2024.02.04 6.5 7.60 6.20 - 67.3 - 59.5 29.1 46.3 48.9 60.3 23.2
Qwen1.5-14B 2024.02.04 12.6 - - - 78.7 77.6 67.6 - 37.8 44.0 70.1 29.2
Qwen1.5-14B-Chat Y 2024.02.04 12.6 7.9 - - - - - - - - - -
Qwen2-7B 2024.06.07 6.5 - - - 83.2 83.9 70.3 40.0 51.2 65.9 79.9 44.2
Qwen2-7b-Instruct Y 2024.06.07 6.5 8.41 7.21 51.4 80.9 77.2 70.5 44.1 79.9 67.2 85.7 52.9
Qwen2.5-3B-Instruct Y 2024.9.19 2.8 - - - - - - 43.7 74.4 72.7 86.7 65.9
Qwen2.5-7B 2024.9.19 6.5 - - - - - 74.2 45.0 57.9 74.9 85.4 49.8
Qwen2.5-7B-Instruct Y 2024.09.19 6.5 8.75 - 74.9 - - - 56.3 84.8 79.2 91.6 75.5
Llama-3.1-8B 2024.07.23 7.0 8.3 5.7 71.5 55.2 55.8 66.7 37.1 - - 84.5 51.9
Llama-3.2-3B 2024.09.25 2.8 - - 77.4 - - 63.4 - - - 77.7 48.0
Phi-3.5-mini-instruct Y 2024.08.23 3.6 8.6 5.7 49.4 46.1 46.9 69.0 47.4 62.8 69.6 86.2 48.5
MiniCPM3-4B Y 2024.09.05 3.9 8.41 6.74 68.4 73.6 73.3 67.2 - 74.4 72.5 81.1 46.6
Yi-1.5-6B-Chat Y 2024.05.11 5.5 7.50 6.20 - 74.2 74.7 61.0 - 64.0 70.9 78.9 40.5
GLM-4-9B-chat Y 2024.06.04 8.2 8.35 7.01 64.5 75.6 71.5 72.4 - 71.8 - 79.6 50.6
Baichuan2-13B-Base 2023.09.06 12.6 - 5.25 - 58.1 62.0 59.2 - 17.1 30.2 52.8 10.1
  • The metrics for the Qwen2-1.5B model differ between the original paper and the Qwen2.5 report. Currently, the accuracy figures from the original paper are being used.

Audio Understanding

Model Base model Release Time Fleurs test-zh WenetSpeech test_net WenetSpeech test_meeting
Megrez-3B-Omni Megrez-3B-Instruct 2024.12.16 10.8 - 16.4
Whisper-large-v3 - 2023.11.06 12.4 17.5 30.8
Qwen2-Audio-7B Qwen2-7B 2024.08.09 9 11 10.7
Baichuan2-omni Unknown-7B 2024.10.11 7 6.9 8.4
VITA Mixtral 8x7B 2024.08.12 - -/12.2(CER) -/16.5(CER)

Inference Speed

image_tokens prefill (tokens/s) decode (tokens/s)
Megrez-3B-Omni 448 6312.66 1294.9
Qwen2-VL-2B 1378 7349.39 685.66
MiniCPM-V-2_6 448 2167.09 452.51

Setup:

  • The testing environment utilizes an NVIDIA H100 GPU with vLLM. Each test includes 128 text tokens and a 720×1480 image as input, producing 128 output tokens, with num_seqs fixed at 8.
  • Under this setup, the decoding speed of Qwen2-VL-2B is slower than Megrez-3B-Omni, despite having a smaller base LLM. This is due to the larger number of image tokens generated when encoding images of the specified size, which impacts actual inference speed.

Quickstart

Online Experience

HF Chat Demo(recommend)

Local Deployment

For environment installation and vLLM inference code deployment, refer to Infini-Megrez-Omni

Below is an example of using transformers for inference. By passing text, image, and audio in the content field, you can interact with various modalities and models.

import torch
from transformers import AutoModelForCausalLM

path = "{{PATH_TO_PRETRAINED_MODEL}}"  # Change this to the path of the model.

model = (
    AutoModelForCausalLM.from_pretrained(
        path,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    .eval()
    .cuda()
)

# Chat with text and image
messages = [
    {
        "role": "user",
        "content": {
            "text": "Please describe the content of the image.",
            "image": "./data/sample_image.jpg",
        },
    },
]

# Chat with audio and image
messages = [
    {
        "role": "user",
        "content": {
            "image": "./data/sample_image.jpg",
            "audio": "./data/sample_audio.m4a",
        },
    },
]

MAX_NEW_TOKENS = 100
response = model.chat(
    messages,
    sampling=False,
    max_new_tokens=MAX_NEW_TOKENS,
    temperature=0,
)
print(response)

Notes

  1. We recommend to put the images in the first round of chat for better inference results. There are no such restrictions for audio and text, which can be switched freely.
  2. In the Automatic Speech Recognition (ASR) scenario, simply change content['text'] to "Convert speech to text."
  3. In the OCR scenario, enabling sampling may introduce language model hallucinations which cause text changes. Users may consider disabling sampling in inference (sampling=False). However, disabling sampling may introduce model repetition.

Open Source License and Usage Statement

  • License: The code in this repository is open-sourced under the Apache-2.0 license.
  • Hallucination: Large models inherently have hallucination issues. Users should not completely trust the content generated by the model.
  • Values and Safety: While we have made every effort to ensure compliance of the data used during training, the large volume and complexity of the data may still lead to unforeseen issues. We disclaim any liability for problems arising from the use of this open-source model, including but not limited to data security issues, public opinion risks, or risks and problems caused by misleading, misuse, propagation, or improper utilization of the model.