update readme w/ examples
Browse files- .gitattributes +3 -0
- README.md +66 -24
- examples/example-1.png +3 -0
- examples/example-2.png +3 -0
- examples/sft-examples.png +3 -0
.gitattributes
CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
examples/example-1.png filter=lfs diff=lfs merge=lfs -text
|
37 |
+
examples/example-2.png filter=lfs diff=lfs merge=lfs -text
|
38 |
+
examples/sft-examples.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -9,31 +9,73 @@ pipeline_tag: image-text-to-text
|
|
9 |
# Model description
|
10 |
We are excited to announce the continuation and rebranding of our **BLIP series** into **XGen-MM**, to be better aligned with Salesforce's unified XGen initiative for large foundation models! This rebranding marks a significant step in our ongoing development of cutting-edge multimodal technologies.
|
11 |
|
12 |
-
`
|
13 |
|
14 |
-
In the v1.
|
15 |
-
-
|
16 |
-
-
|
17 |
-
-
|
18 |
-
-
|
19 |
|
20 |
In addition to the models, we are also releasing a series of datasets for multi-modal pre-training, including:
|
21 |
-
- [MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens](https://arxiv.org/abs/2406.11271)
|
22 |
-
- BLIP3-OCR-200M: a dataset with dense OCR annotations.
|
23 |
-
- BLIP3-GROUNDING-50M: a dataset for enhancing the ability to ground semantic concepts in images.
|
24 |
- BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
|
25 |
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
# Results
|
30 |
|
31 |
-
### Base model (without instruction tuning)
|
32 |
|
33 |
-
|
|
|
34 |
|
35 |
-
|
36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
# How to use
|
39 |
|
@@ -53,23 +95,23 @@ We strongly recommend users assess safety and fairness before applying to downst
|
|
53 |
|
54 |
# License
|
55 |
|
56 |
-
Our code and weights are released under the
|
57 |
|
58 |
# Code acknowledgement
|
59 |
Our training code is based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo), and part of our data preprocessing code is adapted from [LLaVA](https://github.com/haotian-liu/LLaVA).
|
60 |
-
|
61 |
|
62 |
We thank the authors for their open-source implementations.
|
63 |
|
64 |
|
65 |
# Citation
|
66 |
```
|
67 |
-
@
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
}
|
74 |
```
|
75 |
|
|
|
9 |
# Model description
|
10 |
We are excited to announce the continuation and rebranding of our **BLIP series** into **XGen-MM**, to be better aligned with Salesforce's unified XGen initiative for large foundation models! This rebranding marks a significant step in our ongoing development of cutting-edge multimodal technologies.
|
11 |
|
12 |
+
`xGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
|
13 |
|
14 |
+
In the v1.5 (08/2024) release, we present a series of XGen-MM models including:
|
15 |
+
- [π€ xGen-MM-base](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1.5): `xgen-mm-phi3-mini-base-r-v1.5`
|
16 |
+
- [π€ xGen-MM-instruct](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1.5): `xgen-mm-phi3-mini-instruct-r-v1.5`
|
17 |
+
- [π€ xGen-MM-instruct-interleave](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-multi-r-v1.5): `xgen-mm-phi3-mini-instruct-multi-r-v1.5`
|
18 |
+
- [π€ xGen-MM-instruct-dpo](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-dpo-r-v1.5): `xgen-mm-phi3-mini-instruct-dpo-r-v1.5`
|
19 |
|
20 |
In addition to the models, we are also releasing a series of datasets for multi-modal pre-training, including:
|
21 |
+
- [π MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens](https://arxiv.org/abs/2406.11271)
|
22 |
+
- [π€ BLIP3-OCR-200M](https://huggingface.co/datasets/Salesforce/blip3-ocr-200m): a dataset with dense OCR annotations.
|
23 |
+
- [π€ BLIP3-GROUNDING-50M](https://huggingface.co/datasets/Salesforce/blip3-grounding-50m): a dataset for enhancing the ability to ground semantic concepts in images.
|
24 |
- BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
|
25 |
|
26 |
+
For more details, check out our [tech report]() and project page (coming soon).
|
|
|
|
|
|
|
27 |
|
|
|
28 |
|
29 |
+
# Data
|
30 |
+
The instruct model is fine-tuned on a mixture of around 1 million samples from multiple domains. All the fine-tuning data are from public sources, most of which are covered in [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron).
|
31 |
|
32 |
+
# Results
|
33 |
|
34 |
+
### Single-image benchmarks
|
35 |
+
|
36 |
+
| Model (Size) | SEED -IMG | SEED v2 | MMB (dev) | MM Star | MME (norm) | CVB -2D | CVB -3D | RealW QA | MMMU (val) | Math Vista | Sci QA | POPE | Text VQA | Avg. all | Avg. perc. |
|
37 |
+
|--------------------------------|:---------:|:-------:|:----------:|:-------:|:-----------:|:-------:|:-----------------:|-------------------|:-----------------:|:-----------------:|:-----------------:|:-----------------:|----------------|:--------------:|----------------|
|
38 |
+
| Closed-source models | | | | | | | | | | | | | | | |
|
39 |
+
| GPT-4V<sup>*</sup> | 72.0 | - | 80.8 | 49.7 | 63.3 | 64.3 | 73.8 | 56.5 | 53.8 | 48.2 | 82.1 | 75.4 | - | - | - |
|
40 |
+
| MM1-3B-Chat (3B) | 68.8 | - | 67.8 | - | 62.9 | - | - | - | 33.9 | - | - | 87.4 | - | - | - |
|
41 |
+
| Open-source models | | | | | | | | | | | | | | | |
|
42 |
+
| HPT-1.5-edge (4B) | **72.3** | - | 74.6 | 45.8 | - | - | - | - | 42.6 | **45.1** | 85.4 | **91.0** | - | - | - |
|
43 |
+
| VILA-1.5-3B (3B) | 67.9 | - | 63.4 | - | - | - | - | - | 33.3 | - | 69.0 | 85.9 | - | - | - |
|
44 |
+
| VILA-1.5-3B<sup>**</sup> (3B) | 67.9 | 51.9 | 62.4 | 40.3 | 58.5 | 50.1 | 60.3 | 53.3 | 34.1 | 30.6 | 68.9 | 86.9 | 58.1 | 55.6 | 59.1 |
|
45 |
+
| phi-3-vision (4B) | - | - | 80.5 | - | - | - | - | - | - | 44.5 | 90.8 | 85.8 | 70.9 | - | - |
|
46 |
+
| phi-3-vision<sup>**</sup> (4B) | 71.0 | 52.7 | 74.2 | <u>47.9</u> | 55.3 | 60.7 | 68.2 | 59.1 | **46.1** | **45.1** | **90.2** | 83.5 | **73.3** | 63.6 | 63.6 |
|
47 |
+
| xGen-MM-inst. (4B) | 71.8 | <u>53.9</u> | <u>76</u> | 46.7 | <u>63.8</u> | <u>66.2</u> | **75.4** | **61.6** | <u>42.8</u> | 39.2 | 85.6 | 87.0 | <u>72.0</u> | <u>64.8</u> | <u>66.9</u> |
|
48 |
+
| **<u>xGen-MM-inst.-interleave (4B)</u>** | <u>72.2</u> | **55.5** | **76.8** | **48.1** | **64.4** | **69.3** | <u>72.3</u> | <u>60.5</u> | 41.1 | <u>39.6</u> | <u>88.3</u> | 87.0 | 71.0 | **65.1** | **67.3** |
|
49 |
+
|
50 |
+
* GPT-4V(gpt-4-1106-preview) results are taken from this third-party [leaderborad](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard).
|
51 |
+
** Model results are tested with our evaluation code for a fair comparison.
|
52 |
+
|
53 |
+
### Multi-image benchmarks
|
54 |
+
|
55 |
+
| Model | BLINK | QBench-2 | Mantis-eval |
|
56 |
+
|-------------------------------|:-----------------:|:-----------------:|:-----------------:|
|
57 |
+
| GPT-4V <sup>†</sup> | 51.1 | 73.4 | 62.7 |
|
58 |
+
| VILA-1.5-3B<sup>††</sup> (3B) | 39.8 | 51.7 | 41.9 |
|
59 |
+
| xGen-MM-inst. (4B) | 46.6 | 52.4 | 42.4 |
|
60 |
+
| **<u>xGen-MM-inst.-interleave (4B)</u>** | 49.7 | 75.1 | 56.7 |
|
61 |
+
† GPT-4V results are the numbers reported in each benchmark's original paper.
|
62 |
+
†† Model results are tested with our evaluation code for a fair comparison.
|
63 |
+
|
64 |
+
|
65 |
+
### Examples
|
66 |
+
|
67 |
+
<p>
|
68 |
+
<figure class="half">
|
69 |
+
<a href="examples/example-1.png"><img src="./examples/example-1.png"></a>
|
70 |
+
<a href="examples/example-2.png"><img src="./examples/example-2.png"></a>
|
71 |
+
</figure>
|
72 |
+
</p>
|
73 |
+
|
74 |
+
<p>
|
75 |
+
<figure>
|
76 |
+
<a href="examples/sft-examples.png"><img src="./examples/sft-examples.png"></a>
|
77 |
+
</figure>
|
78 |
+
</p>
|
79 |
|
80 |
# How to use
|
81 |
|
|
|
95 |
|
96 |
# License
|
97 |
|
98 |
+
Our code and weights are released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) license.
|
99 |
|
100 |
# Code acknowledgement
|
101 |
Our training code is based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo), and part of our data preprocessing code is adapted from [LLaVA](https://github.com/haotian-liu/LLaVA).
|
102 |
+
The evaluation code for the instruct models is based on [VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LVLMs)](https://github.com/open-compass/VLMEvalKit).
|
103 |
|
104 |
We thank the authors for their open-source implementations.
|
105 |
|
106 |
|
107 |
# Citation
|
108 |
```
|
109 |
+
@article{blip3-xgenmm,
|
110 |
+
author = {Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu},
|
111 |
+
title = {xGen-MM(BLIP-3): A Family of Open Large Multimodal Models},
|
112 |
+
journal = {arXiv preprint},
|
113 |
+
month = {August},
|
114 |
+
year = {2024},
|
115 |
}
|
116 |
```
|
117 |
|
examples/example-1.png
ADDED
Git LFS Details
|
examples/example-2.png
ADDED
Git LFS Details
|
examples/sft-examples.png
ADDED
Git LFS Details
|