Files changed (1) hide show
  1. README.md +38 -14
README.md CHANGED
@@ -11,28 +11,52 @@ We are excited to announce the continuation and rebranding of our **BLIP series*
11
 
12
  `XGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
13
 
14
- In the v1.1 (08/2024) release, we present a series of XGen-MM models including:
15
- - Base model `xgen-mm-phi3-mini-base-r-v1.5`
16
- - Single-image instruct model `xgen-mm-phi3-mini-instruct-r-v1.5`
17
- - Multi-image instruct model `xgen-mm-phi3-mini-instruct-multi-r-v1.5`
18
- - DPO instruct model `xgen-mm-phi3-mini-instruct-dpo-r-v1.5`
19
 
20
  In addition to the models, we are also releasing a series of datasets for multi-modal pre-training, including:
21
- - [MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens](https://arxiv.org/abs/2406.11271)
22
- - BLIP3-OCR-200M: a dataset with dense OCR annotations.
23
- - BLIP3-GROUNDING-50M: a dataset for enhancing the ability to ground semantic concepts in images.
24
  - BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
25
 
 
 
26
  # Data
 
27
 
28
 
29
  # Results
30
 
31
- ### Base model (without instruction tuning)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- ### Instruct model
34
 
35
- ### DPO model
 
 
 
 
36
 
37
 
38
  # How to use
@@ -41,8 +65,8 @@ Please check out our [inference notebook](demo.ipynb) for example code to use ou
41
 
42
  # Reproducibility:
43
 
44
- Our evaluation is implemented based on [open-compass/VLMEvalKit](https://github.com/open-compass/VLMEvalKit). We will create a PR to that repo to support XGen-MM evaluation.
45
-
46
 
47
  # Bias, Risks, Limitations, and Ethical Considerations
48
  The main data sources are from the internet, including webpages,
@@ -65,7 +89,7 @@ We thank the authors for their open-source implementations.
65
  # Citation
66
  ```
67
  @misc{xgen_mm_phi3_mini,
68
- title={xgen-mm-phi3-mini-instruct Model Card},
69
  url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1},
70
  author={Salesforce AI Research},
71
  month={May},
 
11
 
12
  `XGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
13
 
14
+ In the v1.5 (08/2024) release, we present a series of XGen-MM models including:
15
+ - [πŸ€— xGen-MM-base](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1.5): `xgen-mm-phi3-mini-base-r-v1.5`
16
+ - [πŸ€— xGen-MM-instruct](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1.5): `xgen-mm-phi3-mini-instruct-r-v1.5`
17
+ - [πŸ€— xGen-MM-instruct-interleave](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-multi-r-v1.5): `xgen-mm-phi3-mini-instruct-multi-r-v1.5`
18
+ - [πŸ€— xGen-MM-instruct-dpo](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-dpo-r-v1.5): `xgen-mm-phi3-mini-instruct-dpo-r-v1.5`
19
 
20
  In addition to the models, we are also releasing a series of datasets for multi-modal pre-training, including:
21
+ - [πŸƒ MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens](https://arxiv.org/abs/2406.11271)
22
+ - [πŸ€— BLIP3-OCR-200M](https://huggingface.co/datasets/Salesforce/blip3-ocr-200m): a dataset with dense OCR annotations.
23
+ - [πŸ€— BLIP3-GROUNDING-50M](https://huggingface.co/datasets/Salesforce/blip3-grounding-50m): a dataset for enhancing the ability to ground semantic concepts in images.
24
  - BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
25
 
26
+ For more details, check out our [tech report]() and project page (coming soon).
27
+
28
  # Data
29
+ The base model is pre-trained on a mixture of data sources described above, with around 100 billion image-text tokens in total.
30
 
31
 
32
  # Results
33
 
34
+ ### Few-shot Evaluation on Base model (without instruction tuning)
35
+
36
+ | Model | Shot | VQAv2 | TextVQA | OKVQA | COCO | NoCaps | TextCaps |
37
+ |:--------------|:-----|:------|:--------|:------|:------|:-------|:---------|
38
+ | Flamingo-3B | 0 | 49.2 | 30.1 | 41.2 | 73.0 | - | - |
39
+ | | 4 | 53.2 | 32.7 | 43.3 | 85.0 | - | - |
40
+ | | 8 | 55.4 | 32.4 | 44.6 | 90.6 | - | - |
41
+ | MM1-3B | 0 | 46.2 | 29.4 | 26.1 | 73.5 | 55.6 | 63.3 |
42
+ | | 4 | 57.9 | 45.3 | 44.6 | **112.3** | 99.7 | 84.1 |
43
+ | | 8 | 63.6 | 44.6 | 48.4 | **114.6** | **104.7** | 88.8 |
44
+ | xGen-MM-base | 0 | 43.1 | 34.0 | 28.0 | 67.2 | 82.6 | 69.5 |
45
+ | | 4 | **66.3**| **54.2**| **48.9**| 107.6 | **100.8**| **89.9** |
46
+ | | 8 | **66.9**| **55.3**| **50.1**| 109.8| 104.6| **94.0**|
47
+
48
+
49
+ ### Showcases on In-Context Learning
50
+
51
+ Below are some qualitative examples below of the mutli-modal in-context learning capacity of our base model.
52
 
53
+ <img src="icl_examples/art.png" alt="Art" width=500>
54
 
55
+
56
+ <img src="icl_examples/animal.png" alt="Animal" width=500>
57
+
58
+
59
+ <img src="icl_examples/street.png" alt="Street" width=500>
60
 
61
 
62
  # How to use
 
65
 
66
  # Reproducibility:
67
 
68
+ The pretraining evaluation is implemented based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo).
69
+ Few-shot examples are randomly drawn so there will be some variance with different random seeds.
70
 
71
  # Bias, Risks, Limitations, and Ethical Considerations
72
  The main data sources are from the internet, including webpages,
 
89
  # Citation
90
  ```
91
  @misc{xgen_mm_phi3_mini,
92
+ title={xgen-mm-phi3-mini-base Model Card},
93
  url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1},
94
  author={Salesforce AI Research},
95
  month={May},