Update README.md
Browse files
README.md
CHANGED
@@ -82,19 +82,20 @@ print(results)
|
|
82 |
## Training data
|
83 |
|
84 |
The BridgeTower model was pretrained on four public image-caption datasets:
|
85 |
-
- [Conceptual Captions(
|
86 |
-
- [
|
87 |
-
- [
|
|
|
88 |
- [Visual Genome](https://visualgenome.org/)
|
89 |
|
90 |
-
The total number of unique images in the combined data is
|
91 |
|
92 |
## Training procedure
|
93 |
|
94 |
### Pretraining
|
95 |
|
96 |
-
The model was pre-trained for
|
97 |
-
The optimizer used was AdamW with a learning rate of 1e-
|
98 |
|
99 |
## Evaluation results
|
100 |
Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower's performance on Image Retrieval and other down stream tasks.
|
@@ -102,13 +103,9 @@ Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower'
|
|
102 |
### BibTeX entry and citation info
|
103 |
```bibtex
|
104 |
@article{xu2022bridge,
|
105 |
-
title={
|
106 |
-
author={Xu, Xiao and
|
107 |
-
|
108 |
-
|
109 |
-
Lal, Vasudev and
|
110 |
-
Duan, Nan},
|
111 |
-
journal={arXiv preprint arXiv:2206.08657},
|
112 |
-
year={2022}
|
113 |
}
|
114 |
```
|
|
|
82 |
## Training data
|
83 |
|
84 |
The BridgeTower model was pretrained on four public image-caption datasets:
|
85 |
+
- [Conceptual Captions (CC3M)](https://ai.google.com/research/ConceptualCaptions/)
|
86 |
+
- [Conceptual 12M (CC12M](https://github.com/google-research-datasets/conceptual-12m)
|
87 |
+
- [SBU Captions](https://www.cs.rice.edu/~vo9/sbucaptions/)
|
88 |
+
- [MSCOCO Captions](https://arxiv.org/pdf/1504.00325.pdf)
|
89 |
- [Visual Genome](https://visualgenome.org/)
|
90 |
|
91 |
+
The total number of unique images in the combined data is around 16M.
|
92 |
|
93 |
## Training procedure
|
94 |
|
95 |
### Pretraining
|
96 |
|
97 |
+
The model was pre-trained for 10 epochs on an Intel AI supercomputing cluster using 512 Gaudis and 128 Xeons with a batch size of 2048.
|
98 |
+
The optimizer used was AdamW with a learning rate of 1e-7. No data augmentation was used except for center-crop. The image resolution in pre-training is set to 294 x 294.
|
99 |
|
100 |
## Evaluation results
|
101 |
Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower's performance on Image Retrieval and other down stream tasks.
|
|
|
103 |
### BibTeX entry and citation info
|
104 |
```bibtex
|
105 |
@article{xu2022bridge,
|
106 |
+
title={BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning},
|
107 |
+
author={Xu, Xiao and Wu, Chenfei and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
|
108 |
+
journal={arXiv preprint arXiv:2206.08657},
|
109 |
+
year={2022}
|
|
|
|
|
|
|
|
|
110 |
}
|
111 |
```
|