h2oai
/

h2ovl-mississippi-2b

@@ -13,6 +13,12 @@ thumbnail: >-
 pipeline_tag: text-generation
 ---
 # Model Card
 The H2OVL-Mississippi-2B is a high-performing, general-purpose vision-language model developed by H2O.ai to handle a wide range of multimodal tasks. This model, with 2 billion parameters, excels in tasks such as image captioning, visual question answering (VQA), and document understanding, while maintaining efficiency for real-world applications.
 The Mississippi-2B model builds on the strong foundations of our H2O-Danube language models, now extended to integrate vision and language tasks. It competes with larger models across various benchmarks, offering a versatile and scalable solution for document AI, OCR, and multimodal reasoning.
@@ -30,7 +36,29 @@ The Mississippi-2B model builds on the strong foundations of our H2O-Danube lang
 - Optimized for Vision-Language Tasks: Achieves high performance across a wide range of applications, including document AI, OCR, and multimodal reasoning.
 - Comprehensive Dataset: Trained on 17M image-text pairs, ensuring broad coverage and strong task generalization.
-## Usage
 ### Install dependencies:
 ```bash
@@ -42,7 +70,7 @@ If you have ampere GPUs, install flash-attention to speed up inference:
 pip install flash_attn
 ```
-### Sample demo:
 ```python
 import torch
@@ -86,23 +114,6 @@ print(f'User: {question}\nAssistant: {response}')
 ```
-## Benchmarks
-### Performance Comparison of Similar Sized Models Across Multiple Benchmarks - OpenVLM Leaderboard
-| **Models**                 | **Params (B)** | **Avg. Score** | **MMBench** | **MMStar** | **MMMU<sub>VAL</sub>** | **Math Vista** | **Hallusion** | **AI2D<sub>TEST</sub>** | **OCRBench** | **MMVet** |
-|----------------------------|----------------|----------------|-------------|------------|-----------------------|----------------|---------------|-------------------------|--------------|-----------|
-| Qwen2-VL-2B                | 2.1            | **57.2**       | **72.2**    | 47.5       | 42.2                  | 47.8           | **42.4**      | 74.7                    | **797**      | **51.5**  |
-| **H2OVL-Mississippi-2B**    | 2.1            | 54.4           | 64.8        | 49.6       | 35.2                  | **56.8**       | 36.4          | 69.9                    | 782          | 44.7      |
-| InternVL2-2B               | 2.1            | 53.9           | 69.6        | **49.8**   | 36.3                  | 46.0           | 38.0          | 74.1                    | 781          | 39.7      |
-| Phi-3-Vision               | 4.2            | 53.6           | 65.2        | 47.7       | **46.1**              | 44.6           | 39.0          | **78.4**                 | 637          | 44.1      |
-| MiniMonkey                 | 2.2            | 52.7           | 68.9        | 48.1       | 35.7                  | 45.3           | 30.9          | 73.7                    | **794**      | 39.8      |
-| MiniCPM-V-2                | 2.8            | 47.9           | 65.8        | 39.1       | 38.2                  | 39.8           | 36.1          | 62.9                    | 605          | 41.0      |
-| InternVL2-1B               | 0.8            | 48.3           | 59.7        | 45.6       | 36.7                  | 39.4           | 34.3          | 63.8                    | 755          | 31.5      |
-| PaliGemma-3B-mix-448       | 2.9            | 46.5           | 65.6        | 48.3       | 34.9                  | 28.7           | 32.2          | 68.3                    | 614          | 33.1      |
-| **H2OVL-Mississippi-0.8B** | 0.8            | 43.5           | 47.7        | 39.1       | 34.0                  | 39.0           | 29.6          | 53.6                    | 751          | 30.0      |
-| DeepSeek-VL-1.3B           | 2.0            | 39.6           | 63.8        | 39.9       | 33.8                  | 29.8           | 27.6          | 51.5                    | 413          | 29.2      |
 ## Prompt Engineering for JSON Extraction

 pipeline_tag: text-generation
 ---
 # Model Card
+[\[📜 H2OVL-Mississippi Paper\]](https://arxiv.org/abs/2410.13611)
+[\[🤗 HF Demo\]](https://huggingface.co/spaces/h2oai/h2ovl-mississippi)
+[\[🚀 Quick Start\]](#quick-start)
 The H2OVL-Mississippi-2B is a high-performing, general-purpose vision-language model developed by H2O.ai to handle a wide range of multimodal tasks. This model, with 2 billion parameters, excels in tasks such as image captioning, visual question answering (VQA), and document understanding, while maintaining efficiency for real-world applications.
 The Mississippi-2B model builds on the strong foundations of our H2O-Danube language models, now extended to integrate vision and language tasks. It competes with larger models across various benchmarks, offering a versatile and scalable solution for document AI, OCR, and multimodal reasoning.
 - Optimized for Vision-Language Tasks: Achieves high performance across a wide range of applications, including document AI, OCR, and multimodal reasoning.
 - Comprehensive Dataset: Trained on 17M image-text pairs, ensuring broad coverage and strong task generalization.
+## Benchmarks
+### Performance Comparison of Similar Sized Models Across Multiple Benchmarks - OpenVLM Leaderboard
+| **Models**                 | **Params (B)** | **Avg. Score** | **MMBench** | **MMStar** | **MMMU<sub>VAL</sub>** | **Math Vista** | **Hallusion** | **AI2D<sub>TEST</sub>** | **OCRBench** | **MMVet** |
+|----------------------------|----------------|----------------|-------------|------------|-----------------------|----------------|---------------|-------------------------|--------------|-----------|
+| Qwen2-VL-2B                | 2.1            | **57.2**       | **72.2**    | 47.5       | 42.2                  | 47.8           | **42.4**      | 74.7                    | **797**      | **51.5**  |
+| **H2OVL-Mississippi-2B**    | 2.1            | 54.4           | 64.8        | 49.6       | 35.2                  | **56.8**       | 36.4          | 69.9                    | 782          | 44.7      |
+| InternVL2-2B               | 2.1            | 53.9           | 69.6        | **49.8**   | 36.3                  | 46.0           | 38.0          | 74.1                    | 781          | 39.7      |
+| Phi-3-Vision               | 4.2            | 53.6           | 65.2        | 47.7       | **46.1**              | 44.6           | 39.0          | **78.4**                 | 637          | 44.1      |
+| MiniMonkey                 | 2.2            | 52.7           | 68.9        | 48.1       | 35.7                  | 45.3           | 30.9          | 73.7                    | **794**      | 39.8      |
+| MiniCPM-V-2                | 2.8            | 47.9           | 65.8        | 39.1       | 38.2                  | 39.8           | 36.1          | 62.9                    | 605          | 41.0      |
+| InternVL2-1B               | 0.8            | 48.3           | 59.7        | 45.6       | 36.7                  | 39.4           | 34.3          | 63.8                    | 755          | 31.5      |
+| PaliGemma-3B-mix-448       | 2.9            | 46.5           | 65.6        | 48.3       | 34.9                  | 28.7           | 32.2          | 68.3                    | 614          | 33.1      |
+| **H2OVL-Mississippi-0.8B** | 0.8            | 43.5           | 47.7        | 39.1       | 34.0                  | 39.0           | 29.6          | 53.6                    | 751          | 30.0      |
+| DeepSeek-VL-1.3B           | 2.0            | 39.6           | 63.8        | 39.9       | 33.8                  | 29.8           | 27.6          | 51.5                    | 413          | 29.2      |
+## Quick Start
+We provide an example code to run h2ovl-mississippi-2b using `transformers`.
 ### Install dependencies:
 ```bash
 pip install flash_attn
 ```
+### Inference with Transformers:
 ```python
 import torch
 ```
 ## Prompt Engineering for JSON Extraction