Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
Abstract
Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored. In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (muP) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models. We open-source these pretrained models at https://ibm.biz/BdKhLa.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts (2024)
- Scaling Law with Learning Rate Annealing (2024)
- Layerwise Recurrent Router for Mixture-of-Experts (2024)
- Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models (2024)
- Large Language Models as Foundations for Next-Gen Dense Retrieval: A Comprehensive Empirical Assessment (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Is this the right place for feedback even if the paper was submitted by @akhaliq instead of one of the authors?
In case it is: In Figure 2, if the first plot ranges from lr 0.0002 to 0.0256, might as well do the same in the second plot for consistency instead of stopping at 0.0128. Also same perplexity range in both plots would be nice instead of 40-100 in the first one and 40-60 in the second one.
Same in fig 4, where a b c range ~1e-4 to ~5e-6 on the y-axis, while d ranges from 8e-5 to 6e-6. maybe add a grid for better comparability.
fig 5 as well, if the first plot ranges 43 to 55 and the second one ranges 44 to 56, might as well make both 43 to 56 :)
So excited to see such interesting work. But I have a question: in Hypothesis 1 you mentioned "we only keep the three best batch sizes to focus on the optimal scenario", and then directly use these "best three batch sizes" to fit a and b. But in actual scenarios, we don’t know what the optimal batch size is. Is the lr obtained in this case still the best?
Models citing this paper 9
Browse 9 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper