Spaces:
Running
Running
--- | |
title: "Hugging Face Accelerate: Making device-agnostic ML training and inference easy at scale" | |
format: | |
revealjs: | |
theme: moon | |
fig-format: png | |
auto-play-media: true | |
--- | |
## Who am I? | |
- Zachary Mueller | |
- Technical Lead for the π€ Accelerate project | |
- Maintain the `transformers` Trainer | |
- API design geek | |
## What is π€ Accelerate? | |
* A training framework | |
* An inference framework | |
* A command-line interface | |
## A Training Framework | |
* Powered by PyTorch | |
* Change a few lines of code, gain device *and* hardware-agnostic capabilities | |
* Low-code, with minimal magic aimed at easy hackability and use without high-level abstractions | |
* We handle the intracies so you don't have to | |
## A Training Framework | |
{style="font-size: 70%;"} | |
* Support for any hardware-accelerator on the market: | |
* CPU, GPU, TPU, XPU, NPU, MLU | |
* Automatic mixed-precision training *safely* in whatever fashion you may choose: | |
* FP16, BF16, FP8 (through either `TransformerEngine` or `MS-AMP`) | |
* Automatic and efficient gradient accumulation | |
* Support for quantization through `bitsandbytes` | |
* Support your favorite experiment trackers (`aim`, `clearml`, `comet_ml`, `dvc-lite`, `ml-flow`, `tensorboard`, `wandb`) | |
* Easy to configure plugin or YAML-level API for setting up advanced frameworks like `FSDP`, `DeepSpeed`, and `Megatron-LM` | |
## Low-Code | |
{style="font-size: 70%;"} | |
* Biggest friction with "wrapper" libraries is control of your code | |
* By being minimally intrusive, your code just "works" while still giving you complete control | |
{style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"} | |
```diff | |
import torch | |
import torch.nn.functional as F | |
from datasets import load_dataset | |
+ from accelerate import Accelerator | |
+ accelerator = Accelerator() | |
- device = 'cpu' | |
+ device = accelerator.device | |
model = torch.nn.Transformer().to(device) | |
optimizer = torch.optim.Adam(model.parameters()) | |
dataset = load_dataset('my_dataset') | |
data = torch.utils.data.DataLoader(dataset, shuffle=True) | |
+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) | |
model.train() | |
for epoch in range(10): | |
for source, targets in dataloader: | |
source, targets = source.to(device), targets.to(device) | |
optimizer.zero_grad() | |
output = model(source) | |
loss = F.cross_entropy(output, targets) | |
- loss.backward() | |
+ accelerator.backward(loss) | |
optimizer.step() | |
``` | |
::: | |
## Easy to integrate | |
{style="font-size: 70%;"} | |
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks: | |
1. Create an `Accelerator` | |
{style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"} | |
```diff | |
import torch | |
import torch.nn.functional as F | |
from datasets import load_dataset | |
+ from accelerate import Accelerator | |
+ accelerator = Accelerator() | |
device = 'cpu' | |
model = torch.nn.Transformer().to(device) | |
optimizer = torch.optim.Adam(model.parameters()) | |
dataset = load_dataset('my_dataset') | |
data = torch.utils.data.DataLoader(dataset, shuffle=True) | |
model.train() | |
for epoch in range(10): | |
for source, targets in dataloader: | |
source, targets = source.to(device), targets.to(device) | |
optimizer.zero_grad() | |
output = model(source) | |
loss = F.cross_entropy(output, targets) | |
loss.backward() | |
optimizer.step() | |
``` | |
::: | |
## Easy to integrate | |
{style="font-size: 70%;"} | |
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks: | |
2. Wrap your PyTorch objects with `accelerator.prepare` and remove device-placements | |
{style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"} | |
```diff | |
import torch | |
import torch.nn.functional as F | |
from datasets import load_dataset | |
from accelerate import Accelerator | |
accelerator = Accelerator() | |
- device = 'cpu' | |
model = torch.nn.Transformer().to(device) | |
optimizer = torch.optim.Adam(model.parameters()) | |
dataset = load_dataset('my_dataset') | |
data = torch.utils.data.DataLoader(dataset, shuffle=True) | |
+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) | |
model.train() | |
for epoch in range(10): | |
for source, targets in dataloader: | |
source, targets = source.to(device), targets.to(device) | |
optimizer.zero_grad() | |
output = model(source) | |
loss = F.cross_entropy(output, targets) | |
loss.backward() | |
optimizer.step() | |
``` | |
::: | |
## Easy to integrate | |
{style="font-size: 70%;"} | |
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks: | |
3. Use `accelerator.backward` for the backward pass | |
{style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"} | |
```diff | |
import torch | |
import torch.nn.functional as F | |
from datasets import load_dataset | |
from accelerate import Accelerator | |
accelerator = Accelerator() | |
model = torch.nn.Transformer().to(device) | |
optimizer = torch.optim.Adam(model.parameters()) | |
dataset = load_dataset('my_dataset') | |
data = torch.utils.data.DataLoader(dataset, shuffle=True) | |
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) | |
model.train() | |
for epoch in range(10): | |
for source, targets in dataloader: | |
source, targets = source.to(device), targets.to(device) | |
optimizer.zero_grad() | |
output = model(source) | |
loss = F.cross_entropy(output, targets) | |
- loss.backward() | |
+ accelerator.backward(loss) | |
optimizer.step() | |
``` | |
::: | |
## But what about inference? | |
* π€ Accelerate is not just for training, and has helped make the GPU-Poor take control of the narrative | |
* Using tools like Big Model Inference, users with *tiny* compute can run large models locally | |
* Started with the boom of stable diffusion, and now has scaled to having the ability to run huge LLMs locally with a single graphics card | |
## How does it work? | |
* PyTorch introduced `device="meta"` | |
* π€ Accelerate introduced `device_map="auto"` | |
{style="padding-left:15%;padding-right:20%"} | |
{{< video big_model_visualization.mp4 width="800" height="400" >}} | |
## A CLI Interface | |
* `accelerate config` | |
* Configure the environment | |
* `accelerate launch` | |
* How to run your script | |
## Launching distributed training is hard | |
{style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"} | |
```bash | |
python script.py | |
``` | |
{style="padding-left:50%;padding-bottom:0%;padding-top:0%;"} | |
vs. | |
<br> | |
{style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"} | |
```bash | |
torchrun --nnodes=1 --nproc_per_node=2 script.py | |
``` | |
{style="padding-left:50%;padding-bottom:0%;padding-top:0%;"} | |
vs. | |
<br> | |
{style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"} | |
```bash | |
deepspeed --num_gpus=2 script.py | |
``` | |
<br> | |
How can we make this better? | |
## `accelerate launch` | |
{style="padding-top:0%;padding-left:5%;padding-right:10%;padding-bottom:0%"} | |
```bash | |
accelerate launch script.py | |
``` | |
<br> | |
```bash | |
accelerate launch --multi_gpu --num_processes 2 script.py | |
``` | |
<br> | |
```bash | |
accelerate launch \ | |
--multi_gpu \ | |
--use_deepspeed \ | |
--num_processes 2 \ | |
script.py | |
``` | |
## `accelerate config` | |
* Rely on `config.yaml` files | |
* Choose to either running `accelerate config` or write your own: | |
{.columns style="font-size: 60%;padding-left:5%;padding-right:5%"} | |
{.column width="40%"} | |
```{.yaml filename=ddp_config.yaml} | |
compute_environment: LOCAL_MACHINE | |
distributed_type: MULTI_GPU | |
main_training_function: main | |
mixed_precision: bf16 | |
num_machines: 1 | |
num_processes: 8 | |
``` | |
{.column width="40%"} | |
```{.yaml filename=fsdp_config.yaml} | |
compute_environment: LOCAL_MACHINE | |
distributed_type: FSDP | |
fsdp_config: | |
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP | |
fsdp_backward_prefetch: BACKWARD_PRE | |
fsdp_cpu_ram_efficient_loading: true | |
fsdp_forward_prefetch: false | |
fsdp_offload_params: false | |
fsdp_sharding_strategy: FULL_SHARD | |
fsdp_state_dict_type: SHARDED_STATE_DICT | |
fsdp_sync_module_states: true | |
fsdp_use_orig_params: false | |
main_training_function: main | |
mixed_precision: bf16 | |
num_machines: 1 | |
num_processes: 8 | |
``` | |
# Now that you're up to speed, what's new? | |
# We've had a busy last year, and so has the ML Community! | |
## New training techniques | |
- Quantization has taken the field by storm | |
- New ideas such as FSDP + QLoRA to train huge models on tiny compute! | |
- New precision backends as we train natively on smaller precision | |
- Optimizing futher how much we can push on a single machine through efficient RAM and timing techniques | |
## Larger compute landscape | |
- As we search for alternatives to NVIDIA, new compilers rise: | |
- XPU (Intel) | |
- NPU (Intel) | |
- MLU (Cambricon) | |
All of which are supported by π€ Accelerate | |
## Lower abstractions | |
* While the `Accelerator` was great, needed better abstractions focused on controlling behaviors | |
* Introduced the `PartialState` | |
{style="padding-left:10%;padding-top:0%;padding-right:15%"} | |
```python | |
from accelerate import PartialState | |
if PartialState().is_main_process: | |
# Run on only 1 device | |
with PartialState().main_process_first: | |
# Useful for dataset processing | |
# Device-agnostic without the bulk of the `Accelerator` | |
device = PartialState().device | |
``` | |
::: | |
## Faster and better inference alternatives | |
{style="font-size:70%"} | |
- `PiPPy` gives us efficient pipeline-parallelism in distributed environments to increase throughput while keeping a simple torch-bound API | |
- Rather than having to wait for each GPU, every GPU can be busy in parallel | |
- Will be critical as larger LLMs take hold and more than one computer is needed | |
{style="font-size:60%;padding-left:19%;padding-top:0%;padding-right:24%;"} | |
```python | |
import torch | |
from transformers import AutoModelForSequenceClassification | |
from accelerate import PartialState, prepare_pippy | |
model = AutoModelForSequenceClassification.from_pretrained("gpt2") | |
model.eval() | |
input = torch.randint( | |
low=0, | |
high=model.config.vocab_size, | |
size=(2, 1024), # bs x seq_len | |
device="cpu", | |
) | |
model = prepare_pippy(model, split_points="auto", example_args=(input,)) | |
with torch.no_grad(): | |
output = model(input) | |
``` | |
::: | |
# Adoption: Accelerate in the ecosystem | |
## Accelerate in the Ecosystem | |
* Many of the frameworks you use daily already rely on π€ Accelerate! | |
* Nearly all of π€ | |
* `axolotl` | |
* `fastai` | |
* `FastChat` | |
* `lucidrains` | |
* `kornia` | |
## Accelerate in the Ecosystem | |
{style="font-size: 70%;"} | |
- Started as a way to isolate out distributed code on TPU and `DistributedDataParallelism` | |
{style="padding-left: 30%"} | |
![](sylvain_tweet.JPG){width="70%"} | |
## Accelerate in the Ecosystem | |
{style="font-size: 70%;"} | |
- Now is the backbone of some of the largest PyTorch training frameworks in the ecosystem | |
{style="padding-left: 30%;"} | |
![](hf_trainer.JPG){width="70%"} | |
# What's next? | |
# Elevating the community | |
* Now that more advanced training techniques are reachable (FSDP, DeepSpeed, etc), we need to focus on educating the community on how to use it best | |
* Goes beyond how to use the `Trainer` or `Accelerator`, but how to use *what* where | |
* Keep Accelerate as a tool for the community to utilize when new techniques come out and play with, to push new ideas to scale quickly | |
# 1.0.0: Soon! | |
* Tried and battle-tested by over 7M users/month | 110M+ total downloads | |
* As we've been stable for over a year now, we're near ready to release 1.0.0 | |
# Thanks for joining! | |
{style="font-size: 70%;"} | |
- [π€ Accelerate documentation](https://hf.co/docs/accelerate) | |
- [Launching distributed code](https://huggingface.co/docs/accelerate/basic_tutorials/launch) | |
- [Distributed code and Jupyter Notebooks](https://huggingface.co/docs/accelerate/basic_tutorials/notebook) | |
- [Migrating to π€ Accelerate easily](https://huggingface.co/docs/accelerate/basic_tutorials/migration) | |
- [Big Model Inference tutorial](https://huggingface.co/docs/accelerate/usage_guides/big_modeling) | |
- [DeepSpeed and π€ Accelerate](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) | |
- [Fully Sharded Data Parallelism and π€ Accelerate](https://huggingface.co/docs/accelerate/usage_guides/fsdp) | |
- [FSDP vs DeepSpeed In-Depth](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed) | |