How was the total training time decided?
#1
by
Harveenchadha
- opened
How did the BigScience team decide the total number of steps and training time?
@Harveenchadha , this might not fully answer your question, but there are some details on this page that you might find informative: https://bigscience.huggingface.co/blog/model-training-launched
It does not look converged to me. Both the training and the validation curves suggest that longer training would be beneficial.
The training time corresponds to one full pass over the training corpus, aka one "epoch".
Training for significantly more than 1 epoch (e.g. 2+ full epochs) would take more compute than was available.
In principle, any party that has several servers with A100s can download model and continue training.