update readme w/ examples

Browse files

Files changed (5) hide show

.gitattributes +3 -0
README.md +66 -24
examples/example-1.png +3 -0
examples/example-2.png +3 -0
examples/sft-examples.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+examples/example-1.png filter=lfs diff=lfs merge=lfs -text
+examples/example-2.png filter=lfs diff=lfs merge=lfs -text
+examples/sft-examples.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -9,31 +9,73 @@ pipeline_tag: image-text-to-text
 # Model description
 We are excited to announce the continuation and rebranding of our **BLIP series** into **XGen-MM**, to be better aligned with Salesforce's unified XGen initiative for large foundation models! This rebranding marks a significant step in our ongoing development of cutting-edge multimodal technologies.
-`XGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
-In the v1.1 (08/2024) release, we present a series of XGen-MM models including:
-- Base model `xgen-mm-phi3-mini-base-r-v1.5`
-- Single-image instruct model `xgen-mm-phi3-mini-instruct-r-v1.5`
-- Multi-image instruct model `xgen-mm-phi3-mini-instruct-multi-r-v1.5`
-- DPO instruct model `xgen-mm-phi3-mini-instruct-dpo-r-v1.5`
 In addition to the models, we are also releasing a series of datasets for multi-modal pre-training, including:
-- [MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens](https://arxiv.org/abs/2406.11271)
-- BLIP3-OCR-200M: a dataset with dense OCR annotations.
-- BLIP3-GROUNDING-50M: a dataset for enhancing the ability to ground semantic concepts in images.
 - BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
-# Data
-# Results
-### Base model (without instruction tuning)
-### Instruct model
-### DPO model
 # How to use
@@ -53,23 +95,23 @@ We strongly recommend users assess safety and fairness before applying to downst
 # License
-Our code and weights are released under the Creative Commons Attribution Non Commercial 4.0 [LICENSE](LICENSE.txt). Please fill out a form at [here](https://forms.gle/ffPc9oZC2ZGeJ1N68) to consult the commercial use of model weights.
 # Code acknowledgement
 Our training code is based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo), and part of our data preprocessing code is adapted from [LLaVA](https://github.com/haotian-liu/LLaVA).
-Our evaluation code is based on [VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LVLMs)](https://github.com/open-compass/VLMEvalKit).
 We thank the authors for their open-source implementations.
 # Citation
 ```
-@misc{xgen_mm_phi3_mini,
-    title={xgen-mm-phi3-mini-instruct Model Card},
-    url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1},
-    author={Salesforce AI Research},
-    month={May},
-    year={2024}
 }
 ```

 # Model description
 We are excited to announce the continuation and rebranding of our **BLIP series** into **XGen-MM**, to be better aligned with Salesforce's unified XGen initiative for large foundation models! This rebranding marks a significant step in our ongoing development of cutting-edge multimodal technologies.
+`xGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
+In the v1.5 (08/2024) release, we present a series of XGen-MM models including:
+- [🤗 xGen-MM-base](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1.5): `xgen-mm-phi3-mini-base-r-v1.5`
+- [🤗 xGen-MM-instruct](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1.5): `xgen-mm-phi3-mini-instruct-r-v1.5`
+- [🤗 xGen-MM-instruct-interleave](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-multi-r-v1.5): `xgen-mm-phi3-mini-instruct-multi-r-v1.5`
+- [🤗 xGen-MM-instruct-dpo](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-dpo-r-v1.5): `xgen-mm-phi3-mini-instruct-dpo-r-v1.5`
 In addition to the models, we are also releasing a series of datasets for multi-modal pre-training, including:
+- [🍃 MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens](https://arxiv.org/abs/2406.11271)
+- [🤗 BLIP3-OCR-200M](https://huggingface.co/datasets/Salesforce/blip3-ocr-200m): a dataset with dense OCR annotations.
+- [🤗 BLIP3-GROUNDING-50M](https://huggingface.co/datasets/Salesforce/blip3-grounding-50m): a dataset for enhancing the ability to ground semantic concepts in images.
 - BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
+For more details, check out our [tech report]() and project page (coming soon).
+# Data
+The instruct model is fine-tuned on a mixture of around 1 million samples from multiple domains. All the fine-tuning data are from public sources, most of which are covered in [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron).
+# Results
+### Single-image benchmarks
+| Model (Size)                   | SEED -IMG | SEED v2 | MMB  (dev) | MM Star | MME  (norm) | CVB -2D |      CVB -3D      | RealW QA          |     MMMU (val)    |     Math Vista    |       Sci QA      |        POPE       | Text VQA       |    Avg. all    | Avg. perc.     |
+|--------------------------------|:---------:|:-------:|:----------:|:-------:|:-----------:|:-------:|:-----------------:|-------------------|:-----------------:|:-----------------:|:-----------------:|:-----------------:|----------------|:--------------:|----------------|
+| Closed-source models           |           |         |            |         |             |         |                   |                   |                   |                   |                   |                   |                |                |                |
+| GPT-4V<sup>&ast;</sup>                         |    72.0   |     -   |     80.8   |   49.7  |     63.3    |   64.3  |  73.8 |  56.5 |  53.8 |  48.2 |  82.1 |  75.4 |  - |  - |  - |
+| MM1-3B-Chat (3B)               |    68.8   |    -    |    67.8    |    -    |     62.9    |    -    |         -         |         -         |        33.9       |         -         |         -         |        87.4       |        -       |        -       |        -       |
+| Open-source models             |           |         |            |         |             |         |                   |                   |                   |                   |                   |                   |                |                |                |
+| HPT-1.5-edge (4B)              |    **72.3**  |    -    |    74.6    |   45.8  |      -      |    -    |         -         |         -         |        42.6       |        **45.1**       |        85.4       |        **91.0**       |        -       |        -       |        -       |
+| VILA-1.5-3B (3B)               |    67.9   |    -    |    63.4    |    -    |      -      |    -    |         -         |         -         |        33.3       |         -         |        69.0       |        85.9       |        -       |        -       |        -       |
+| VILA-1.5-3B<sup>&ast;&ast;</sup> (3B)      |    67.9   |   51.9  |    62.4    |   40.3  |     58.5    |   50.1  |        60.3       |        53.3       |        34.1       |        30.6       |        68.9       |        86.9       |      58.1      |      55.6      |      59.1      |
+| phi-3-vision (4B)              |     -     |    -    |    80.5    |    -    |      -      |    -    |         -         |         -         |         -         |        44.5       |        90.8       |        85.8       |      70.9      |        -       |        -       |
+| phi-3-vision<sup>&ast;&ast;</sup> (4B)     |    71.0   |   52.7  |    74.2    |   <u>47.9</u>  |     55.3    |   60.7  |        68.2       |        59.1       |        **46.1**      |        **45.1**       |        **90.2**       |        83.5       |      **73.3**      |      63.6      |      63.6      |
+| xGen-MM-inst. (4B)             |    71.8   |   <u>53.9</u>  |     <u>76</u>     |   46.7  |     <u>63.8</u>    |   <u>66.2</u>  |        **75.4**       |        **61.6**       |        <u>42.8</u>       |        39.2       |        85.6       |        87.0       |      <u>72.0</u>      |      <u>64.8</u>      |      <u>66.9</u>      |
+| **<u>xGen-MM-inst.-interleave (4B)</u>** |    <u>72.2</u>   |   **55.5**  |    **76.8**    |   **48.1**  |     **64.4**    |   **69.3**  |        <u>72.3</u>       |        <u>60.5</u>       |        41.1       |        <u>39.6</u>       |        <u>88.3</u>       |        87.0       |      71.0      |      **65.1**      |      **67.3**      |
+&ast; GPT-4V(gpt-4-1106-preview) results are taken from this third-party [leaderborad](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard).
+&ast;&ast; Model results are tested with our evaluation code for a fair comparison.
+### Multi-image benchmarks
+| Model                         |       BLINK       |      QBench-2     |    Mantis-eval    |
+|-------------------------------|:-----------------:|:-----------------:|:-----------------:|
+| GPT-4V <sup>&dagger;</sup>       |              51.1 |              73.4 |              62.7 |
+| VILA-1.5-3B<sup>&dagger;&dagger;</sup> (3B)     |        39.8       |        51.7       |        41.9       |
+| xGen-MM-inst. (4B)            |        46.6       |        52.4       |        42.4       |
+| **<u>xGen-MM-inst.-interleave (4B)</u>** |        49.7       |        75.1       |        56.7       |
+&dagger; GPT-4V results are the numbers reported in each benchmark's original paper.
+&dagger;&dagger; Model results are tested with our evaluation code for a fair comparison.
+### Examples
+<p>
+<figure class="half">
+    <a href="examples/example-1.png"><img src="./examples/example-1.png"></a>
+    <a href="examples/example-2.png"><img src="./examples/example-2.png"></a>
+</figure>
+</p>
+<p>
+<figure>
+    <a href="examples/sft-examples.png"><img src="./examples/sft-examples.png"></a>
+</figure>
+</p>
 # How to use
 # License
+Our code and weights are released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) license.
 # Code acknowledgement
 Our training code is based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo), and part of our data preprocessing code is adapted from [LLaVA](https://github.com/haotian-liu/LLaVA).
+The evaluation code for the instruct models is based on [VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LVLMs)](https://github.com/open-compass/VLMEvalKit).
 We thank the authors for their open-source implementations.
 # Citation
 ```
+@article{blip3-xgenmm,
+  author    = {Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu},
+  title     = {xGen-MM(BLIP-3): A Family of Open Large Multimodal Models},
+  journal   = {arXiv preprint},
+  month     = {August},
+  year      = {2024},
 }
 ```

examples/example-1.png ADDED Viewed

Git LFS Details

SHA256: 78373da19f77ccd7174148f988a18f8698be43f9f4b0a2bb5aa8810b51e16539
Pointer size: 132 Bytes
Size of remote file: 2.05 MB

examples/example-2.png ADDED Viewed

Git LFS Details

SHA256: 39420b5e4bd2d59eacd2be21fb64987906f5cf6122332c09de55fa2a7cd1ea58
Pointer size: 132 Bytes
Size of remote file: 2.87 MB

examples/sft-examples.png ADDED Viewed

Git LFS Details

SHA256: edaff18706e1ff10aabc3f9c0cb71c3213615ef8d0a186c0bdf2b069c2e477a4
Pointer size: 132 Bytes
Size of remote file: 1.07 MB