Transformers documentation

Multi-GPU inference

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Multi-GPU inference

Built-in Tensor Parallelism (TP) is now available with certain models using PyTorch. Tensor parallelism shards a model onto multiple GPUs, enabling larger model sizes, and parallelizes computations such as matrix multiplication.

To enable tensor parallel, pass the argument tp_plan="auto" to from_pretrained():

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Initialize distributed
rank = int(os.environ["RANK"])
device = torch.device(f"cuda:{rank}")
torch.distributed.init_process_group("nccl", device_id=device)

# Retrieve tensor parallel model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    tp_plan="auto",
)

# Prepare input tokens
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Can I help"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

# Distributed run
outputs = model(inputs)

You can use torchrun to launch the above script with multiple processes, each mapping to a GPU:

torchrun --nproc-per-node 4 demo.py

PyTorch tensor parallel is currently supported for the following models:

You can request to add tensor parallel support for another model by opening a GitHub Issue or Pull Request.

Expected speedups

You can benefit from considerable speedups for inference, especially for inputs with large batch size or long sequences.

For a single forward pass on Llama with a sequence length of 512 and various batch sizes, the expected speedup is as follows:

< > Update on GitHub