Training data / resources
I'd like to understand more about how much data DOFA was pretrained with, on what computational resources, and for how long.
- How many datasets was it pretrained on?
- How big was each dataset?
- What model / how many GPUs was it trained on?
- How long was is trained for?
In the paper the information related to this I see is:
To reduce the computational cost of self-supervised training on extensive datasets, we design a continual pretraining strategy inspired by Mendieta et al. [20] incorporating a distillation loss and a weight initialization strategy. This method effectively utilizes knowledge from expansive, supervised, pretrained models, reducing the computational burden and associated CO2 emissions
Looking at [20] I see:
8 NVIDIA V100 GPUs with a batch size of 2048 (128 per GPU) and the image size of 192×192. And the GFM variant uses 93 hours.
Is that the same / similar to what the DOFA weights used? Any more details about the hardware and training procedure that was used would be helpful.