Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
stas 
posted an update Jul 3
Post
1081
The Universal Checkpointing paper is out! https://arxiv.org/abs/2406.18820

If you remember the Bigscience BLOOM-176B training, Tunji Ruwase and I co-invented this technology for Megatron-Deepspeed in order to enable to quickly scale up and down node topology while continuing training.

Since then the DeepSpeed team continued improving on that and it has now been fully integrated into Deepspeed.

The blog post is here: https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ucp/README.md
In this post