BridgeTower
/

bridgetower-large-itm-mlm-gaudi

Inference Endpoints

Model card Files Files and versions Community

shaoyent commited on Jan 26, 2023

Commit

855785f

·

1 Parent(s): ba03f0c

Update README.md

Files changed (1) hide show

README.md +11 -14

README.md CHANGED Viewed

@@ -82,19 +82,20 @@ print(results)
 ## Training data
 The BridgeTower model was pretrained on four public image-caption datasets:
-- [Conceptual Captions(CC)](https://ai.google.com/research/ConceptualCaptions/),
-- [SBU Captions](https://www.cs.rice.edu/~vo9/sbucaptions/),
-- [MSCOCO Captions](https://arxiv.org/pdf/1504.00325.pdf),
 - [Visual Genome](https://visualgenome.org/)
-The total number of unique images in the combined data is 4M.
 ## Training procedure
 ### Pretraining
-The model was pre-trained for ___ steps on an "Intel AI supercomputing cluster" using 512 Gaudis and 128 Xeons with a batch size of 4096.
-The optimizer used was AdamW with a learning rate of 1e-5. No data augmentation was used except for center-crop. The image resolution in pre-training is set to 288 x 288.
 ## Evaluation results
 Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower's performance on Image Retrieval and other down stream tasks.
@@ -102,13 +103,9 @@ Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower'
 ### BibTeX entry and citation info
 ```bibtex
 @article{xu2022bridge,
-title={Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning},
-author={Xu, Xiao and
-Wu, Chenfei and
-Rosenman, Shachar and
-Lal, Vasudev and
-Duan, Nan},
-journal={arXiv preprint arXiv:2206.08657},
-year={2022}
 }
 ```

 ## Training data
 The BridgeTower model was pretrained on four public image-caption datasets:
+- [Conceptual Captions (CC3M)](https://ai.google.com/research/ConceptualCaptions/)
+- [Conceptual 12M (CC12M](https://github.com/google-research-datasets/conceptual-12m)
+- [SBU Captions](https://www.cs.rice.edu/~vo9/sbucaptions/)
+- [MSCOCO Captions](https://arxiv.org/pdf/1504.00325.pdf)
 - [Visual Genome](https://visualgenome.org/)
+The total number of unique images in the combined data is around 16M.
 ## Training procedure
 ### Pretraining
+The model was pre-trained for 10 epochs on an Intel AI supercomputing cluster using 512 Gaudis and 128 Xeons with a batch size of 2048.
+The optimizer used was AdamW with a learning rate of 1e-7. No data augmentation was used except for center-crop. The image resolution in pre-training is set to 294 x 294.
 ## Evaluation results
 Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower's performance on Image Retrieval and other down stream tasks.
 ### BibTeX entry and citation info
 ```bibtex
 @article{xu2022bridge,
+  title={BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning},
+  author={Xu, Xiao and Wu, Chenfei and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
+  journal={arXiv preprint arXiv:2206.08657},
+  year={2022}
 }
 ```