Transformers
PyTorch
English
bridgetower
gaudi
Inference Endpoints
shaoyent commited on
Commit
855785f
1 Parent(s): ba03f0c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -14
README.md CHANGED
@@ -82,19 +82,20 @@ print(results)
82
  ## Training data
83
 
84
  The BridgeTower model was pretrained on four public image-caption datasets:
85
- - [Conceptual Captions(CC)](https://ai.google.com/research/ConceptualCaptions/),
86
- - [SBU Captions](https://www.cs.rice.edu/~vo9/sbucaptions/),
87
- - [MSCOCO Captions](https://arxiv.org/pdf/1504.00325.pdf),
 
88
  - [Visual Genome](https://visualgenome.org/)
89
 
90
- The total number of unique images in the combined data is 4M.
91
 
92
  ## Training procedure
93
 
94
  ### Pretraining
95
 
96
- The model was pre-trained for ___ steps on an "Intel AI supercomputing cluster" using 512 Gaudis and 128 Xeons with a batch size of 4096.
97
- The optimizer used was AdamW with a learning rate of 1e-5. No data augmentation was used except for center-crop. The image resolution in pre-training is set to 288 x 288.
98
 
99
  ## Evaluation results
100
  Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower's performance on Image Retrieval and other down stream tasks.
@@ -102,13 +103,9 @@ Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower'
102
  ### BibTeX entry and citation info
103
  ```bibtex
104
  @article{xu2022bridge,
105
- title={Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning},
106
- author={Xu, Xiao and
107
- Wu, Chenfei and
108
- Rosenman, Shachar and
109
- Lal, Vasudev and
110
- Duan, Nan},
111
- journal={arXiv preprint arXiv:2206.08657},
112
- year={2022}
113
  }
114
  ```
 
82
  ## Training data
83
 
84
  The BridgeTower model was pretrained on four public image-caption datasets:
85
+ - [Conceptual Captions (CC3M)](https://ai.google.com/research/ConceptualCaptions/)
86
+ - [Conceptual 12M (CC12M](https://github.com/google-research-datasets/conceptual-12m)
87
+ - [SBU Captions](https://www.cs.rice.edu/~vo9/sbucaptions/)
88
+ - [MSCOCO Captions](https://arxiv.org/pdf/1504.00325.pdf)
89
  - [Visual Genome](https://visualgenome.org/)
90
 
91
+ The total number of unique images in the combined data is around 16M.
92
 
93
  ## Training procedure
94
 
95
  ### Pretraining
96
 
97
+ The model was pre-trained for 10 epochs on an Intel AI supercomputing cluster using 512 Gaudis and 128 Xeons with a batch size of 2048.
98
+ The optimizer used was AdamW with a learning rate of 1e-7. No data augmentation was used except for center-crop. The image resolution in pre-training is set to 294 x 294.
99
 
100
  ## Evaluation results
101
  Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower's performance on Image Retrieval and other down stream tasks.
 
103
  ### BibTeX entry and citation info
104
  ```bibtex
105
  @article{xu2022bridge,
106
+ title={BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning},
107
+ author={Xu, Xiao and Wu, Chenfei and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
108
+ journal={arXiv preprint arXiv:2206.08657},
109
+ year={2022}
 
 
 
 
110
  }
111
  ```